Let's make sure we don't clobber the return parameter on failure, to
follow our coding style. Also, break the loop early if we have all
attributes we need.
This also changes the keys parameter to a simple char**, so that we can
use STRV_MAKE() for passing the list of attributes to read.
This also makes it possible to distuingish the case when the whole
attribute file doesn't exist from one key in it missing. In the former
case we return -ENOENT, in the latter we now return -ENXIO.
These helper calls are potentially called often, and allocate FILE*
objects internally for a very short period of time, let's turn off
locking for them too.
We never use these functions seperately, hence don't bother splitting
them into to.
Also, simplify things a bit, and maintain tables for the attribute files
to chown. Let's also update those tables a bit, and include thenew
"cgroup.threads" file in it, that needs to be delegated too, according
to the documentation.
When a process becomes a zombie its cgroup might be deleted. Let's add
some minimal code to detect cases like this, so that we can still
attribute this back to the original cgroup.
Already, path_is_safe() refused paths container the "." dir. Doing that
isn't strictly necessary to be "safe" by most definitions of the word.
But it is necessary in order to consider a path "normalized". Hence,
"path_is_safe()" is slightly misleading a name, but
"path_is_normalize()" is more descriptive, hence let's rename things
accordingly.
No functional changes.
On cgroupsv2 we should also chown()/chmod() the subtree_control file,
so that children can use controllers the way they like.
On cgroupsv1 we should also chown()/chmod() cgroups.clone_children, as
not setting this for new cgroups makes little sense, and hence delegated
clients should be able to write to it.
Note that error handling for both cases is different. subtree_control
matters so we check for errors, but the clone_children/tasks stuff
doesn't really, as it's legacy stuff. Hence we only log errors and
proceed.
Fixes: #6216
These errors don't really matter, that's why we log and proceed in the
current code. However, we currently log at LOG_WARNING, but we really
shouldn't given that this is library code. Hence downgrade this to
LOG_DEBUG.
This moves pretty much all uses of getpid() over to getpid_raw(). I
didn't specifically check whether the optimization is worth it for each
replacement, but in order to keep things simple and systematic I
switched over everything at once.
Let's just check the unified level, directly. There's really no value in
wrapping cg_unified_controllers() with this, i.e. potentially do string
comparison when there's no reason to.
Also, this makes the clal more alike cg_hybrid_unified().
We use our cgroup APIs in various contexts, including from our libraries
sd-login, sd-bus. As we don#t control those environments we can't rely
that the unified cgroup setup logic succeeds, and hence really shouldn't
assert on it.
This more or less reverts 415fc41cea.
If we encounter an error in proc cmdline parsing, just treat that as permanent,
i.e. the same as if the option was not specified. Realistically, it is better
to use the same condition for all related mounts, then to have e.g.
/sys/fs/cgroup mounted and /sys/fs/cgroup/unified not. If we find something is
mounted and base our answer on that, cache that result too.
Fix the conditions so that if "unified" is used, make sure any "hybrid" mounts
are not mounted.
We need this to gracefully support older or strangely configured kernels.
v2:
- do not install a callback handler, just embed the right conditions into
cg_is_*_wanted()
v3:
- fix bug in cg_is_legacy_wanted()
The default default is set to "legacy", with "hybrid" and "unified"
being the other two alternatives.
There invert the behaviour for systemd.legacy_systemd_cgroup_controller:
if it is not specified on the kernel command line, "hybrid" is used if
selected as the default. If this option is specified, "hybrid" is used if false,
and full "legacy" if true.
Also make all fields in the configure summary lowercase (unless they are
capitalized names) for consistency.
v2:
- update for the fixed interpreation of systemd.legacy_systemd_cgroup_controller
v232's cgroup hybrid mode mounted v2 on /sys/fs/cgroup/systemd, which
unfortunately broke other tools which expect v1 there. From v233 on, hybrid
mode instead mounts and uses v2 on /sys/fs/cgroup/unified and keeps
/sys/fs/cgroup/systemd on v1 for compatibility with external tools. However,
to keep systemd live upgrades working, v233+ should be able to recognize v232
layout and keep using it.
This patch adds v232 hybrid mode support. If v232 layout is detected,
cg_unified(SYSTEMD_CGRouP_CONTROLLER) keeps returning %true but
cg_hybrid_unified() returns %false. This keeps process management on cgroup v2
but turns off the parallel layout.
Currently the hybrid mode mounts cgroup v2 on /sys/fs/cgroup instead of the v1
name=systemd hierarchy. While this works fine for systemd itself, it breaks
tools which expect cgroup v1 hierarchy on /sys/fs/cgroup/systemd.
This patch updates the hybrid mode so that it mounts v2 hierarchy on
/sys/fs/cgroup/unified and keeps v1 "name=systemd" hierarchy on
/sys/fs/cgroup/systemd for compatibility. systemd itself doesn't depend on the
"name=systemd" hierarchy at all. All operations take place on the v2 hierarchy
as before but the v1 hierarchy is kept in sync so that any tools which expect
it to be there can keep doing so. This allows systemd to take advantage of
cgroup v2 process management without requiring other tools to be aware of the
hybrid mode.
The hybrid mode is implemented by mapping the special systemd controller to
/sys/fs/cgroup/unified and making the basic cgroup utility operations -
cg_attach(), cg_create(), cg_rmdir() and cg_trim() - also operate on the
/sys/fs/cgroup/systemd hierarchy whenever the cgroup2 hierarchy is updated.
While a bit messy, this will allow dropping complications from using cgroup v1
for process management a lot sooner than otherwise possible which should make
it a net gain in terms of maintainability.
v2: Fixed !cgns breakage reported by @evverx and renamed the unified mount
point to /sys/fs/cgroup/unified as suggested by @brauner.
v3: chown the compat hierarchy too on delegation. Suggested by @evverx.
v4: [zj]
- drop the change to default, full "legacy" is still the default.
1d84ad9445 reversed the meaning of the option.
The kernel command line option has the opposite meaning to the function,
i.e. specifying "legacy=yes" means "unifed systemd controller=no".
SYSTEMD_CGROUP_CONTROLLER is currently defined as "name=systemd" which cgroup
utility functions interpret as a named cgroup hierarchy with the specified
named. With the planned cgroup hybrid mode changes, SYSTEMD_CGROUP_CONTROLLER
would map to different hierarchy names.
This patch makes SYSTEMD_CGROUP_CONTROLLER a special string "_systemd" which is
substituted to "name=systemd" by the cgroup utility functions. This allows the
callers to address the systemd hierarchy without actually specifying the
hierarchy name allowing the cgroup utility functions to map it to whatever is
appropriate.
Note that SYSTEMD_CGROUP_CONTROLLER was already special on full unified cgroup
hierarchy even before this patch.
cg_[all_]unified() test whether a specific controller or all controllers are on
the unified hierarchy. While what's being asked is a simple binary question,
the callers must assume that the functions may fail any time, which
unnecessarily complicates their usages. This complication is unnecessary.
Internally, the test result is cached anyway and there are only a few places
where the test actually needs to be performed.
This patch simplifies cg_[all_]unified().
* cg_[all_]unified() are updated to return bool. If the result can't be
decided, assertion failure is triggered. Error handlings from their callers
are dropped.
* cg_unified_flush() is updated to calculate the new result synchrnously and
return whether it succeeded or not. Places which need to flush the test
result are updated to test for failure. This ensures that all the following
cg_[all_]unified() tests succeed.
* Places which expected possible cg_[all_]unified() failures are updated to
call and test cg_unified_flush() before calling cg_[all_]unified(). This
includes functions used while setting up mounts during boot and
manager_setup_cgroup().
This improves kernel command line parsing in a number of ways:
a) An kernel option "foo_bar=xyz" is now considered equivalent to
"foo-bar-xyz", i.e. when comparing kernel command line option names "-" and
"_" are now considered equivalent (this only applies to the option names
though, not the option values!). Most of our kernel options used "-" as word
separator in kernel command line options so far, but some used "_". With
this change, which was a source of confusion for users (well, at least of
one user: myself, I just couldn't remember that it's systemd.debug-shell,
not systemd.debug_shell). Considering both as equivalent is inspired how
modern kernel module loading normalizes all kernel module names to use
underscores now too.
b) All options previously using a dash for separating words in kernel command
line options now use an underscore instead, in all documentation and in
code. Since a) has been implemented this should not create any compatibility
problems, but normalizes our documentation and our code.
c) All kernel command line options which take booleans (or are boolean-like)
have been reworked so that "foobar" (without argument) is now equivalent to
"foobar=1" (but not "foobar=0"), thus normalizing the handling of our
boolean arguments. Specifically this means systemd.debug-shell and
systemd_debug_shell=1 are now entirely equivalent.
d) All kernel command line options which take an argument, and where no
argument is specified will now result in a log message. e.g. passing just
"systemd.unit" will no result in a complain that it needs an argument. This
is implemented in the proc_cmdline_missing_value() function.
e) There's now a call proc_cmdline_get_bool() similar to proc_cmdline_get_key()
that parses booleans (following the logic explained in c).
f) The proc_cmdline_parse() call's boolean argument has been replaced by a new
flags argument that takes a common set of bits with proc_cmdline_get_key().
g) All kernel command line APIs now begin with the same "proc_cmdline_" prefix.
h) There are now tests for much of this. Yay!
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.
The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.
The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.
The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.
Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.
This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().
PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.
A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.
Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.