E.g. if you have a monthly event and you set the computer clock back one
year, we can allow the next 12 monthly events to happen naturally. In fact
we already do this when you start a Persistent=yes timer, we just need to
apply the same logic when it's running and we notice the system clock
being set backwards.
On timejumps, including suspend, timer_time_change() calls for a
re-calculation of the next elapse. Sadly I'm not quite sure what the
intended effect of this was! Because it was not managing to fire
OnCalendar= timers which fired during the suspend... unless the timer had
already fired once before.
Reported, entirely correctly as far as I can see, on stackexchange:
https://unix.stackexchange.com/questions/351829/systemd-timer-that-expired-while-suspended
/* If we know the last time this was
* triggered, schedule the job based relative
- * to that. If we don't just start from
- * now. */
+ * to that. If we don't, just start from
+ * the activation time. */
The same code is called for both the initial calculation and this
re-calculation. If we're _not_ already active, then this is before the
activation time has been recorded in the unit, so just use the current
time as before. The new code is mechanically adapted from the same
logic for `OnActiveSec=` (case TIMER_ACTIVE in the code which follows).
Tested with `date --set`.
Motivations:
* Rotate monitoring data from Atop into files which are named per-day.
Fedora currently implements this with a cron job that runs at midnight,
but that didn't handle suspend correctly either.
* unbound-anchor.timer on Fedora, is used to update DNSSEC "root trust
anchor" daily, before the TTL expires. It uses OnCalendar=daily
AccuracySec=24h. Which is a bit suspect because the TTL is 2 days, but I
think it has the right general idea.
None of the other timer settings are correct, because they would not
account for time spent in suspend. Unless you set WakeSystem
(this feature is currently undocumented).
* So in general, we can expect to see people using OnCalendar= for the same
cases as cron.daily and cron.monthly. Which use anacron to keep track of
jobs which should be run even if the system was down at the time.
Timers which are configured to run more frequently than that, are
unlikely to mind if they get run slightly more often that the writer
realized, relative to the amount of time the system was really running.
* From the user report above: "I only want to use remind to show a desktop
notification, it seems excessive to wake up the computer for that. Also,
I would like to get the reminder first thing in the morning, so the
OnActiveSec doesn't help with that."
We have two variables `b` and `base`. `b` is declared within limited
scope; `base` is declared at the top of the function. However `base`
is actually only used within a scope which is exclusive of `b`. Clarify
by moving `base` inside the limited scope as well.
(Also `base` doesn't need initializing any more than `b` does. The
declaration of `base` is now immediately followed by a case analysis of
`v->base`, which serves almost exclusively to determine the value of
`base`).
This reworks system call filter parsing, and replaces a couple of "bool"
function arguments by a single flags parameter.
This shouldn't change behaviour, except for one case: when we
recursively call our parsing function on our own syscall list, then
we'll lower the log level to LOG_DEBUG from LOG_WARNING, because at that
point things are just a problem in our own code rather than in the user
configuration we are parsing, and we shouldn't hence generate confusing
warnings about syntax errors.
Fixes: #8261
We maintain a queue of units and jobs that we are supposed to generate
change/new notifications for because they were either just created or
some of their property has changed. Let's throttle processing of this
queue a bit: as soon as > 1K of bus messages are queued for writing
let's skip processing the queue, and then recheck on the next
iteration again.
Moreover, never process more than 100 units in one go, return to the
event loop after that. Both limits together should put effective limits
on both space and time usage of the function, delaying further
operations until a later moment, when the queue is empty or the the
event loop is sufficiently idle again.
This should keep the number of generated messages much lower than
before on busy systems or where some client is hanging.
Note that this also means a bad client can slow down message dispatching
substantially for up to 90s if it likes to, for all clients. But that
should be acceptable as we only allow trusted bus clients, anyway.
Fixes: #8166
This is an optimization: there's no point in enqueuing unit and job
change notificiation signal messages into bus connection that aren't
fully set up yet.
This doesn't fix#8166 but should lower the load of messages enqueued
but not processed yet a bit.
There isn't much difference, but in general we prefer to use the standard
functions. glibc provides reallocarray since version 2.26.
I moved explicit_bzero is configure test to the bottom, so that the two stdlib
functions are at the bottom.
This partially reverts 3536f49e8f and
3536f49e8f.
When the user is dynamic, and we are setting up state, cache, or logs dirs,
behaviour is unchanged, we always do a recursive chown. This is necessary
because the user number might change between invocations.
But when setting up a directory for non-dynamic user, or a runtime directory
for a dynamic user, do any ownership or mode changes only when the directory
is initially created. Nothing says that the files under those directories have
to be all recursively owned by our user. This restores behaviour before
3536f49e8f, so modifications to the state of
the runtime directory persist between ExecStartPre's and ExecStart's, and even
longer in case the directory is persistent.
I think it _would_ be a nice property if setting a user would automatically
propagate to ownership of any Runtime/Logs/Cache directories. But this is
incompatible with another nice property, namely preserving changes to those
directories made by an admin, and with allowing change of ownership of files
in those directories by the service (e.g. to allow other users to access them).
Of the two, I think the second property is more important. Also, it's backwards
compatible.
https://bugzilla.redhat.com/show_bug.cgi?id=1508495
There is no need to chmod a directory we just created, so move that step
up into a branch. After that, 'effective' is only used once, so get rid of
it too.
Generally we prefer 'return' from main() over exit() so that automatic
cleanups and such work correct. Let's do that in shutdown.c too, becuase
there's not really any reason not to.
With this we are pretty good in consistently using return from main()
rather than exit() all across the codebase. Yay!
So far, we had two implementations of reboot-with-parameter doing pretty
much the same. Let's unify that in a generic implementation used by
both.
This is particulary nice as it unifies all /run/systemd/reboot-param
handling in a single .c file.
This is primarily preparation for a follow-up commit that adds a common
implementation of the other side of the reboot parameter file, i.e. the
code that reads the file and issues reboot() for it.
This mimics the raw_clone() call we have in place already and
establishes a new syscall wrapper raw_reboot() that wraps the kernel's
reboot() system call in a bit more low-level fashion that glibc's
reboot() wrapper. The main difference is that the extra "arg" argument
is supported.
Ultimately this just replaces the syscall wrapper implementation we
currently have at three places in our codebase by a single one.
With this change this means that all our syscall() invocations are
neatly separated out in static inline system call wrappers in our header
functions.
In a number of occasions we use FORK_CLOSE_ALL_FDS when forking off a
child, since we don't want to pass fds to the processes spawned (either
because we later want to execve() some other process there, or because
our child might hang around for longer than expected, in which case it
shouldn't keep our fd pinned). This also closes any logging fds, and
thus means logging is turned off in the child. If we want to do proper
logging, explicitly reopen the logs hence in the child at the right
time.
This is particularly crucial in the umount/remount children we fork off
the shutdown binary, as otherwise the children can't log, which is
why #8155 is harder to debug than necessary: the log messages we
generate about failing mount() system calls aren't actually visible on
screen, as they done in the child processes where the log fds are
closed.
We used to set this, but this was dropped when shutdown got taught to
get the target passed in from the regular PID 1. Let's readd this to
make things more explanatory, and cover all grounds, since after all the
target passed is in theory an optional part of the protocol between the
regular PID 1 and the shutdown PID 1.
We maintain an "extra" set of IP accounting counters that are used when
we systemd is reloaded to carry over the counters from the previous run.
Let's reset these to zero whenever IP accounting is turned off. If we
don't do this then turning off IP accounting and back on later wouldn't
reset the counters, which is quite surprising and different from how our
CPU time counting works.
So, the kernel's management of cgroup/BPF programs is a bit misdesigned:
if you attach a BPF program to a cgroup and close the fd for it it will
stay pinned to the cgroup with no chance of ever removing it again (or
otherwise getting ahold of it again), because the fd is used for
selecting which BPF program to detach. The only way to get rid of the
program again is to destroy the cgroup itself.
This is particularly bad for root the cgroup (and in fact any other
cgroup that we cannot realistically remove during runtime, such as
/system.slice, /init.scope or /system.slice/dbus.service) as getting rid
of the program only works by rebooting the system.
To counter this let's closely keep track to which cgroup a BPF program
is attached and let's implicitly detach the BPF program when we are
about to close the BPF fd.
This hence changes the bpf_program_cgroup_attach() function to track
where we attached the program and changes bpf_program_cgroup_detach() to
use this information. Moreover bpf_program_unref() will now implicitly
call bpf_program_cgroup_detach().
In order to simplify things, bpf_program_cgroup_attach() will now
implicitly invoke bpf_program_load_kernel() when necessary, simplifying
the caller's side.
Finally, this adds proper reference counting to BPF programs. This
is useful for working with two BPF programs in parallel: the BPF program
we are preparing for installation and the BPF program we so far
installed, shortening the window when we detach the old one and reattach
the new one.
So far, for all our API VFS mounts we used the fstype also as mount
source, let's do that for the cgroupsv2 mounts too. The kernel doesn't
really care about the source for API VFS, but it's visible to the user,
hence let's clean this up and follow the rule we otherwise follow.
This new kernel 4.15 flag permits that multiple BPF programs can be
executed for each packet processed: multiple per cgroup plus all
programs defined up the tree on all parent cgroups.
We can use this for two features:
1. Finally provide per-slice IP accounting (which was previously
unavailable)
2. Permit delegation of BPF programs to services (i.e. leaf nodes).
This patch beefs up PID1's handling of BPF to enable both.
Note two special items to keep in mind:
a. Our inner-node BPF programs (i.e. the ones we attach to slices) do
not enforce IP access lists, that's done exclsuively in the leaf-node
BPF programs. That's a good thing, since that way rules in leaf nodes
can cancel out rules further up (i.e. for example to implement a
logic of "disallow everything except httpd.service"). Inner node BPF
programs to accounting however if that's requested. This is
beneficial for performance reasons: it means in order to provide
per-slice IP accounting we don't have to add up all child unit's
data.
b. When this code is run on pre-4.15 kernel (i.e. where
BPF_F_ALLOW_MULTI is not available) we'll make IP acocunting on slice
units unavailable (i.e. revert to behaviour from before this commit).
For leaf nodes we'll fallback to non-ALLOW_MULTI mode however, which
means that BPF delegation is not available there at all, if IP
fw/acct is turned on for the unit. This is a change from earlier
behaviour, where we use the BPF_F_ALLOW_OVERRIDE flag, so that our
fw/acct would lose its effect as soon as delegation was turned on and
some client made use of that. I think the new behaviour is the safer
choice in this case, as silent bypassing of our fw rules is not
possible anymore. And if people want proper delegation then the way
out is a more modern kernel or turning off IP firewalling/acct for
the unit algother.
We make heavy use of BPF functionality these days, hence expose the BPF
file system too by default now. (Note however, that we don't actually
make use bpf file systems object yet, but we might later on too.)
This improves the BPF/cgroup detection logic, and looks whether
BPF_ALLOW_MULTI is supported. This flag allows execution of multiple
BPF filters in a recursive fashion for a whole cgroup tree. It enables
us to properly report IP accounting for slice units, as well as
delegation of BPF support to units without breaking our own IP
accounting.
This introduces a new setting TemporaryFileSystem=. This is useful
to hide files not relevant to the processes invoked by unit, while
necessary files or directories can be still accessed by combining
with Bind{,ReadOnly}Paths=.