These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
Previously the enumerate() callback defined for each unit type would do
two things:
1. It would create perpetual units (i.e. -.slice, system.slice, -.mount and
init.scope)
2. It would enumerate units from /proc/self/mountinfo, /proc/swaps and
the udev database
With this change these two parts are split into two seperate methods:
enumerate() now only does #2, while enumerate_perpetual() is responsible
for #1. Why make this change? Well, perpetual units should have a
slightly different effect that those found through enumeration: as
perpetual units should be up unconditionally, perpetually and thus never
change state, they should also not pull in deps by their state changing,
not even when the state is first set to active. Thus, their state is
generally initialized through the per-device coldplug() method in
similar fashion to the deserialized state from a previous run would be
put into place. OTOH units found through regular enumeration should
result in state changes (and thus pull in deps due to state changes),
hence their state should be put in effect in the catchup() method
instead. Hence, given this difference, let's also separate the
functions, so that the rule is:
1. What is created in enumerate_perpetual() should be started in
coldplug()
2. What is created in enumerate() should be started in catchup().
Scope units don't have a main or control process we can watch, hence
let's explicitly watch the PIDs contained in them early on, just to make
things more robust and have at least something to watch.
This reworks how systemd tracks processes on cgroupv1 systems where
cgroup notification is not reliable. Previously, whenever we had reason
to believe that new processes showed up or got removed we'd scan the
cgroup of the scope or service unit for new processes, and would tidy up
the list of PIDs previously watched. This scanning is relatively slow,
and does not scale well. With this change behaviour is changed: instead
of scanning for new/removed processes right away we do this work in a
per-unit deferred event loop job. This event source is scheduled at a
very low priority, so that it is executed when we have time but does not
starve other event sources. This has two benefits: this expensive work is
coalesced, if events happen in quick succession, and we won't delay
SIGCHLD handling for too long.
This patch basically replaces all direct invocation of
unit_watch_all_pids() in scope.c and service.c with invocations of the
new unit_enqueue_rewatch_pids() call which just enqueues a request of
watching/tidying up the PID sets (with one exception: in
scope_enter_signal() and service_enter_signal() we'll still do
unit_watch_all_pids() synchronously first, since we really want to know
all processes we are about to kill so that we can track them properly.
Moreover, all direct invocations of unit_tidy_watch_pids() and
unit_synthesize_cgroup_empty_event() are removed too, when the
unit_enqueue_rewatch_pids() call is invoked, as the queued job will run
those operations too.
All of this is done on cgroupsv1 systems only, and is disabled on
cgroupsv2 systems as cgroup-empty notifications are reliable there, and
we do not need SIGCHLD events to track processes there.
Fixes: #9138
This adds a flags parameter to unit_notify() which can be used to pass
additional notification information to the function. We the make the old
reload_failure boolean parameter one of these flags, and then add a new
flag that let's unit_notify() if we are configured to restart the
service.
Note that this adjusts behaviour of systemd to match what the docs say.
Fixes: #8398
Scope units are populated from PIDs specified by the bus client. We do
that when a scope is started. We really shouldn't allow scopes to be
started multiple times, as the PIDs then might be heavily out of date.
Moreover, clients should have the guarantee that any scope they allocate
has a clear runtime cycle which is not repetitive.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
This adds a new bus call to service and scope units called
AttachProcesses() that moves arbitrary processes into the cgroup of the
unit. The primary user for this new API is systemd itself: the systemd
--user instance uses this call of the systemd --system instance to
migrate processes if itself gets the request to migrate processes and
the kernel refuses this due to access restrictions.
The primary use-case of this is to make "systemd-run --scope --user …"
invoked from user session scopes work correctly on pure cgroupsv2
environments. There, the kernel refuses to migrate processes between two
unprivileged-owned cgroups unless the requestor as well as the ownership
of the closest parent cgroup all match. This however is not the case
between the session-XYZ.scope unit of a login session and the
user@ABC.service of the systemd --user instance.
The new logic always tries to move the processes on its own, but if
that doesn't work when being the user manager, then the system manager
is asked to do it instead.
The new operation is relatively restrictive: it will only allow to move
the processes like this if the caller is root, or the UID of the target
unit, caller and process all match. Note that this means that
unprivileged users cannot attach processes to scope units, as those do
not have "owning" users (i.e. they have now User= field).
Fixes: #3388
Currently we allowed delegation for alluntis with cgroup backing
except for slices. Let's make this a bit more strict for now, and only
allow this in service and scope units.
Let's also add a generic accessor unit_cgroup_delegate() for checking
whether a unit has delegation turned on that checks the new bool first.
Also, when doing transient units, let's explcitly refuse turning on
delegation for unit types that don#t support it. This is mostly
cosmetical as we wouldn't act on the delegation request anyway, but
certainly helpful for debugging.
`IgnoreOnIsolate=yes` is the default for slices and scopes. So it's not
essential to set it on root.slice or init.scope.
We don't need to worry about a bad unit file configuration. Any attempt
to stop these unit should fail, since we mark them as `perpetual`.
Also since init.scope cannot be stopped, there is no point setting
`KillSignal=SIGRTMIN+14`. According to both documentation and testing,
KillSignal= does not affect the behaviour of `systemctl kill`.
This code is very similar in scope and service units, let's unify it in
one function. This changes little for service units, but for scope units
makes sure we go through the cgroup queue, which is something we should
do anyway.
Let's move the cgroup empty check for all unit types into the generic
unit_check_gc() call, out of the per-unit-type _check_gc() type. This
not only allows us to share some code, but also hooks up mount and
socket units with this kind of check, for free, as it was missing there
previously.
This watches controllers on the bus, and unsets them automatically when
they disappear.
Note that this is primarily a cosmetical fix. Since unique bus names are not
recycled, there's strictly no need to forget about them, but it's a lot
nicer to do so.
And let's make use of it to implement two new unit settings with it:
1. LogLevelMax= is a new per-unit setting that may be used to configure
log priority filtering: set it to LogLevelMax=notice and only
messages of level "notice" and lower (i.e. more important) will be
processed, all others are dropped.
2. LogExtraFields= is a new per-unit setting for configuring per-unit
journal fields, that are implicitly included in every log record
generated by the unit's processes. It takes field/value pairs in the
form of FOO=BAR.
Also, related to this, one exisiting unit setting is ported to this new
facility:
3. The invocation ID is now pulled from /run/systemd/units/ instead of
cgroupfs xattrs. This substantially relaxes requirements of systemd
on the kernel version and the privileges it runs with (specifically,
cgroupfs xattrs are not available in containers, since they are
stored in kernel memory, and hence are unsafe to permit to lesser
privileged code).
/run/systemd/units/ is a new directory, which contains a number of files
and symlinks encoding the above information. PID 1 creates and manages
these files, and journald reads them from there.
Note that this is supposed to be a direct path between PID 1 and the
journal only, due to the special runtime environment the journal runs
in. Normally, today we shouldn't introduce new interfaces that (mis-)use
a file system as IPC framework, and instead just an IPC system, but this
is very hard to do between the journal and PID 1, as long as the IPC
system is a subject PID 1 manages, and itself a client to the journal.
This patch cleans up a couple of types used in journal code:
specifically we switch to size_t for a couple of memory-sizing values,
as size_t is the right choice for everything that is memory.
Fixes: #4089Fixes: #3041Fixes: #4441
This replaces the dependencies Set* objects by Hashmap* objects, where
the key is the depending Unit, and the value is a bitmask encoding why
the specific dependency was created.
The bitmask contains a number of different, defined bits, that indicate
why dependencies exist, for example whether they are created due to
explicitly configured deps in files, by udev rules or implicitly.
Note that memory usage is not increased by this change, even though we
store more information, as we manage to encode the bit mask inside the
value pointer each Hashmap entry contains.
Why this all? When we know how a dependency came to be, we can update
dependencies correctly when a configuration source changes but others
are left unaltered. Specifically:
1. We can fix UDEV_WANTS dependency generation: so far we kept adding
dependencies configured that way, but if a device lost such a
dependency we couldn't them again as there was no scheme for removing
of dependencies in place.
2. We can implement "pin-pointed" reload of unit files. If we know what
dependencies were created as result of configuration in a unit file,
then we know what to flush out when we want to reload it.
3. It's useful for debugging: "systemd-analyze dump" now shows
this information, helping substantially with understanding how
systemd's dependency tree came to be the way it came to be.
scope_abandon() called unit_watch_all_pids() to check if there are any pids in
the cgroup, but unit_watch_all_pids() does nothing (and returns an empty set of
pids) when cg_unified_controller(SYSTEMD) returns true. On hybrid or unified,
cg_unified_controller(SYSTEMD) returns 1, so scope_abandon() thinks the scope
is empty, even though it's not, and marks the scope dead prematurely.
Example output after the scope is marked dead with processes still being present:
● session-24.scope - Session 24 of user guest
Loaded: loaded (/run/systemd/transient/session-24.scope; transient; vendor preset: disabled)
Transient: yes
Active: inactive (dead) since Sat 2017-10-14 15:36:22 CEST; 5min ago
Tasks: 1
CGroup: /user.slice/user-1001.slice/session-24.scope
└─17309 sleep infinity
Subsequent calls to stop the scope unit do nothing, because systemd thinks the
scope is already dead.
https://bugzilla.redhat.com/show_bug.cgi?id=1486859
This is easily reproducible on both F26 and F27:
> ssh guest@localhost
$ pulseaudio & sleep infinity & disown; exit
$ systemctl status $$
> systemctl status session-NN.scope
Tested with unified, hybrid, and legacy layouts, seems to work.
This slightly changes how we log about failures. Previously,
service_enter_dead() would log that a service unit failed along with its
result code, and unit_notify() would do this again but without the
result code. For other unit types only the latter would take effect.
This cleans this up: we keep the message in unit_notify() only for debug
purposes, and add type-specific log lines to all our unit types that can
fail, and always place them before unit_notify() is invoked.
Or in other words: the duplicate log message for service units is
removed, and all other unit types get a more useful line with the
precise result code.
We use our cgroup APIs in various contexts, including from our libraries
sd-login, sd-bus. As we don#t control those environments we can't rely
that the unified cgroup setup logic succeeds, and hence really shouldn't
assert on it.
This more or less reverts 415fc41cea.
cg_[all_]unified() test whether a specific controller or all controllers are on
the unified hierarchy. While what's being asked is a simple binary question,
the callers must assume that the functions may fail any time, which
unnecessarily complicates their usages. This complication is unnecessary.
Internally, the test result is cached anyway and there are only a few places
where the test actually needs to be performed.
This patch simplifies cg_[all_]unified().
* cg_[all_]unified() are updated to return bool. If the result can't be
decided, assertion failure is triggered. Error handlings from their callers
are dropped.
* cg_unified_flush() is updated to calculate the new result synchrnously and
return whether it succeeded or not. Places which need to flush the test
result are updated to test for failure. This ensures that all the following
cg_[all_]unified() tests succeed.
* Places which expected possible cg_[all_]unified() failures are updated to
call and test cg_unified_flush() before calling cg_[all_]unified(). This
includes functions used while setting up mounts during boot and
manager_setup_cgroup().
So far "no_gc" was set on -.slice and init.scope, to units that are always
running, cannot be stopped and never exist in an "inactive" state. Since these
units are the only users of this flag, let's remodel it and rename it
"perpetual" and let's derive more funcitonality off it. Specifically, refuse
enqueing stop jobs for these units, and report that they are "unstoppable" in
the CanStop bus property.
Previously, we'd synthesize the root slice unit and the init scope unit in the
enumerator callbacks for the unit type. This is problematic if either of them
is already referenced from a unit that is loaded as result of another unit
type's enumerator logic.
Let's clean this up and simply create the two objects from the enumerator
callbacks, if they are not around yet. Do the actual filling in of the settings
from the unit_load() callbacks, to match how other units are loaded.
Fixes: #4322
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.
The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.
The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.
The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.
Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.
This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().
PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.
A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.
Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
Currently, systemd uses either the legacy hierarchies or the unified hierarchy.
When the legacy hierarchies are used, systemd uses a named legacy hierarchy
mounted on /sys/fs/cgroup/systemd without any kernel controllers for process
management. Due to the shortcomings in the legacy hierarchy, this involves a
lot of workarounds and complexities.
Because the unified hierarchy can be mounted and used in parallel to legacy
hierarchies, there's no reason for systemd to use a legacy hierarchy for
management even if the kernel resource controllers need to be mounted on legacy
hierarchies. It can simply mount the unified hierarchy under
/sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies.
This disables a significant amount of fragile workaround logics and would allow
using features which depend on the unified hierarchy membership such bpf cgroup
v2 membership test. In time, this would also allow deleting the said
complexities.
This patch updates systemd so that it prefers the unified hierarchy for the
systemd cgroup controller hierarchy when legacy hierarchies are used for kernel
resource controllers.
* cg_unified(@controller) is introduced which tests whether the specific
controller in on unified hierarchy and used to choose the unified hierarchy
code path for process and service management when available. Kernel
controller specific operations remain gated by cg_all_unified().
* "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to
force the use of legacy hierarchy for systemd cgroup controller.
* nspawn: By default nspawn uses the same hierarchies as the host. If
UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all. If
0, legacy for all.
* nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of
three options - legacy, only systemd controller on unified, and unified. The
value is passed into mount setup functions and controls cgroup configuration.
* nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount
option is moved to mount_legacy_cgroup_hierarchy() so that it can take an
appropriate action depending on the configuration of the host.
v2: - CGroupUnified enum replaces open coded integer values to indicate the
cgroup operation mode.
- Various style updates.
v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2.
v4: Restored legacy container on unified host support and fixed another bug in
detect_unified_cgroup_hierarchy().
A following patch will update cgroup handling so that the systemd controller
(/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel
resource controllers are on the legacy hierarchies. This would require
distinguishing whether all controllers are on cgroup v2 or only the systemd
controller is. In preparation, this patch renames cg_unified() to
cg_all_unified().
This patch doesn't cause any functional changes.
Previously, the result value of a unit was overriden with each failure that
took place, so that the result always reported the last failure that took
place.
With this commit this is changed, so that the first failure taking place is
stored instead. This should normally not matter much as multiple failures are
sufficiently uncommon. However, it improves one behaviour: if we send SIGABRT
to a service due to a watchdog timeout, then this currently would be reported
as "coredump" failure, rather than the "watchodg" failure it really is. Hence,
in order to report information about the type of the failure, and not about
the effect of it, let's change this from all unit type to store the first, not
the last failure.
This addresses the issue pointed out here:
https://github.com/systemd/systemd/pull/3818#discussion_r73433520
Let's lot at LOG_NOTICE about any processes that we are going to
SIGKILL/SIGABRT because clean termination of them didn't work.
This turns the various boolean flag parameters to cg_kill(), cg_migrate() and
related calls into a single binary flags parameter, simply because the function
now gained even more parameters and the parameter listed shouldn't get too
long.
Logging for killing processes is done either when the kill signal is SIGABRT or
SIGKILL, or on explicit request if KILL_TERMINATE_AND_LOG instead of LOG_TERMINATE
is passed. This isn't used yet in this patch, but is made use of in a later
patch.
With this change the logic for placing transient unit files and drop-ins
generated via "systemctl set-property" is reworked.
The latter are now placed in the newly introduced "control" unit file
directory. The fomer are now placed in the "transient" unit file directory.
Note that the properties originally set when a transient unit was created will
be written to and stay in the transient unit file directory, while later
changes are done via drop-ins.
This is preparation for a later "systemctl revert" addition, where existing
drop-ins are flushed out, but the original transient definition is restored.
This replaces the old function call manager_is_reloading_or_reexecuting() which
was used only at very few places. Use the new macro wherever we check whether
we are reloading. This should hopefully make things a bit more readable, given
the nature of Manager:n_reloading being a counter.
This clean-ups timeout handling in PID 1. Specifically, instead of storing 0 in internal timeout variables as
indication for a disabled timeout, use USEC_INFINITY which is in-line with how we do this in the rest of our code
(following the logic that 0 means "no", and USEC_INFINITY means "never").
This also replace all usec_t additions with invocations to usec_add(), so that USEC_INFINITY is properly propagated,
and sd-event considers it has indication for turning off the event source.
This also alters the deserialization of the units to restart timeouts from the time they were originally started from.
Before this patch timeouts would be restarted beginning with the time of the deserialization, which could lead to
artificially prolonged timeouts if a daemon reload took place.
Finally, a new RuntimeMaxSec= setting is introduced for service units, that specifies a maximum runtime after which a
specific service is forcibly terminated. This is useful to put time limits on time-intensive processing jobs.
This also simplifies the various xyz_spawn() calls of the various types in that explicit distruction of the timers is
removed, as that is done anyway by the state change handlers, and a state change is always done when the xyz_spawn()
calls fail.
Fixes: #2249
It's nicer to hide the check away in the various
xyz_add_default_dependencies() calls, rather than making it explicit in
the caller, and thus require deeper nesing.
Snapshots were never useful or used for anything. Many systemd
developers that I spoke to at systemd.conf2015, didn't even know they
existed, so it is fairly safe to assume that this type can be deleted
without harm.
The fundamental problem with snapshots is that the state of the system
is dynamic, devices come and go, users log in and out, timers fire...
and restoring all units to some state from the past would "undo"
those changes, which isn't really possible.
Tested by creating a snapshot, running the new binary, and checking
that the transition did not cause errors, and the snapshot is gone,
and snapshots cannot be created anymore.
New systemctl says:
Unknown operation snapshot.
Old systemctl says:
Failed to create snapshot: Support for snapshots has been removed.
IgnoreOnSnaphost settings are warned about and ignored:
Support for option IgnoreOnSnapshot= has been removed and it is ignored
http://lists.freedesktop.org/archives/systemd-devel/2015-November/034872.html