We have the problem that many early boot or late shutdown issues are harder
to solve than they could be because we have no logs. When journald is not
running, messages are redirected to /dev/kmsg. It is also the time when many
things happen in a rapid succession, so we tend to hit the kernel printk
ratelimit fairly reliably. The end result is that we get no logs from the time
where they would be most useful. Thus let's disable the kernels ratelimit.
Once the system is up and running, the ratelimit is not a problem. But during
normal runtime, things also log to journald, and not to /dev/kmsg, so the
ratelimit is not useful. Hence, there doesn't seem to be much point in trying
to restore the ratelimit after boot is finished and journald is up and running.
See kernel's commit 750afe7babd117daabebf4855da18e4418ea845e for the
description of the kenrel interface. Our setting has lower precedence than
explicit configuration on the kenrel command line.
Traditionally, user logins had a $PATH in which /bin was before /sbin, while
root logins had a $PATH with /sbin first. This allows the tricks that
consolehelper is doing to work. But even if we ignore consolehelper, having the
path in this order might have been used by admins for other purposes, and
keeping the order in user sessions will make it easier the adoption of systemd
user sessions a bit easier.
Fixes#733.
https://bugzilla.redhat.com/show_bug.cgi?id=1744059
OOM handling in manager_default_environment wasn't really correct.
Now the (theorertical) malloc failure in strv_new() is handled.
Please note that this has no effect on:
- systems with merged /bin-/sbin (e.g. arch)
- when there are no binaries that differ between the two locations.
E.g. on my F30 laptop there is exactly one program that is affected:
/usr/bin/setup -> consolehelper.
There is less and less stuff that relies on consolehelper, but there's still
some.
So for "clean" systems this makes no difference, but helps with legacy setups.
$ dnf repoquery --releasever=31 --qf %{name} --whatrequires usermode
anaconda-live
audit-viewer
beesu
chkrootkit
driftnet
drobo-utils-gui
hddtemp
mate-system-log
mock
pure-ftpd
setuptool
subscription-manager
system-config-httpd
system-config-rootpassword
system-switch-java
system-switch-mail
usermode-gtk
vpnc-consoleuser
wifi-radar
xawtv
When we would iterate over the lookup paths for each unit, making the list as
short as possible was important for performance. With the current cache, it
doesn't matter much. Two classes of paths were being removed:
- paths which don't exist in the filesystem
- paths which symlink to a path earlier in the search list
Both of those points cause problems with the caching code:
- if a user creates a directory that didn't exist before and puts units there,
now we will notice the new mtime an properly load the unit. When the path
was removed from list, we wouldn't.
- we now properly detect whether a unit path is on the path or not.
Before, if e.g. /lib/systemd/system, /usr/lib/systemd/systemd were both on
the path, and /lib was a symlink to /usr/lib, the second directory would be
pruned from the path. Then, the code would think that a symlink
/etc/systemd/system/foo.service→/lib/systemd/system/foo.service is an alias,
but /etc/systemd/system/foo.service→/usr/lib/systemd/system/foo.service would
be considered a link (in the systemctl link sense).
Removing the pruning has a slight negative performance impact in case of
usr-merge systems which have systemd compiled with non-usr-merge paths.
Non-usr-merge systems are deprecated, and this impact should be very small, so
I think it's OK. If it turns out to be an issue, the loop in function that
builds the cache could be improved to skip over "duplicate" directories with
same logic that the cache pruning did before. I didn't want to add this,
becuase it complicates the code to improve a corner case.
Fixes#13272.
v2:
- do not watch mtime of transient and generated dirs
We'd reload the map after every transient unit we created, which we don't
need to do, since we create those units ourselves and know their fragment
path.
This reworks how we load units from disk. Instead of chasing symlinks every
time we are asked to load a unit by name, we slurp all symlinks from disk
and build two hashmaps:
1. from unit name to either alias target, or fragment on disk
(if an alias, we put just the target name in the hashmap, if a fragment
we put an absolute path, so we can distinguish both).
2. from a unit name to all aliases
Reading all this data can be pretty costly (40 ms) on my machine, so we keep it
around for reuse.
The advantage is that we can reliably know what all the aliases of a given unit
are. This means we can reliably load dropins under all names. This fixes#11972.
Jobs are added to the run queue in random order. This happens because most
jobs are added by iterating over the transaction or dependency hash maps.
As a result, jobs that can be executed at the same time are started in a
different order each time.
On small embedded devices this can cause a measurable jitter for the point
in time when a job starts (~100ms jitter for 10 units that are started in
random order).
This results is a similar jitter for the boot time. This is undesirable in
general and make optimizing the boot time a lot harder.
Also, jobs that should have a higher priority because the unit has a higher
CPU weight might get executed later than others.
Fix this by turning the job run_queue into a Prioq and sort by the
following criteria (use the next if the values are equal):
- CPU weight
- nice level
- unit type
- unit name
The last one is just there for deterministic sorting to avoid any jitter.
$ systemd-analyze dump | head -3
Timestamp firmware: (null)
Timestamp loader: (null)
Timestamp kernel: Mon 2019-07-01 17:21:02 CEST
Since this is a debugging interface, it is OK to change the output format.
The user can infer what "Timestamp firmware: 123.456ms" means.
In this mode we are not supposed to "interact with the environment", so loading
all units and printing warnings about syntax errors and /var/run usage seems
inappropriate.
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).
On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".
All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
So far the priorities for cgroup empty event handling were pretty weird.
The raw events (on cgroupsv2 from inotify, on cgroupsv1 from the agent
dgram socket) where scheduled at a lower priority than the cgroup empty
queue dispatcher. Let's swap that and ensure that we can coalesce events
more agressively: let's process the raw events at higher priority than
the cgroup empty event (which remains at the same prio).
Let's unify the two similar code paths to watch /run/systemd/journal.
The code in manager.c is similar, but it uses mkdir_p_label(), and unifying
that would be too much trouble, so let's just adjust the error messages to
be the same.
CID #1400224.
Some PIDs can remain in the watched list even though their processes have
exited since a long time. It can easily happen if the main process of a forking
service manages to spawn a child before the control process exits for example.
However when a pid is about to be mapped to a unit by calling unit_watch_pid(),
the caller usually knows if the pid should belong to this unit exclusively: if
we just forked() off a child, then we can be sure that its PID is otherwise
unused. In this case we take this opportunity to remove any stalled PIDs from
the watched process list.
If we learnt about a PID in any other form (for example via PID file, via
searching, MAINPID= and so on), then we can't assume anything.
When system manager is started first time or after switching root,
then the udev's device tag data do not exist yet.
So, let's not honor the enumeration results.
Fixes#11997.
This adds a new bitfield to `execute_directories()` which allows to
configure whether to ignore non-zero exit statuses of binaries run and
whether to allow parallel execution of commands.
In case errors are not ignored, the exit status of the failed script
will now be returned for error reposrting purposes or other further
future use.
Memory management is borked for this, and moreover this is unnecessary
since f0831ed2a0, i.e. since coldplug() and catchup() are two different
concepts: the former restoring the state from before a reload, the
latter than adjusting it again to the actual status in effect after the
reload.
Fixes: #10716
Mostly reverts: #8803
PATH_MAX is supposed to include the terminating NUL byte. But we already
check that there is no NUL byte in the specified path. Hence the maximum
length we can expect is PATH_MAX - 1.
This doesn't change much, but makes this use of PATH_MAX consistent with the
rest of the codebase.
add new "systemd-run-generator" for running arbitrary commands from the kernel command line as system services using the "systemd.run=" kernel command line switch
This adds SuccessActionExitStatus= and FailureActionExitStatus= that may
be used to configure the exit status to propagate in when
SuccessAction=exit or FailureAction=exit is used.
When not specified let's also propagate the exit status of the main
process we fork off for the unit.
For PID 1 we adjust the umask to 0, but generators should not run that
way, given that they might be implemented as shell scripts and such.
Let's hence explicitly adjust the umask for them.
We already do this for unit generators. Let's do this for env
generators, too.
Otherwise we keep collecting stuff from env generators, and we really
shouldn't.
This was working properly on reexec but not on reload, as for reexec we
would always start fresh, but for reload would reuse the Manager object
and hence its default environment set.
Fixes: #10671