Already, path_is_safe() refused paths container the "." dir. Doing that
isn't strictly necessary to be "safe" by most definitions of the word.
But it is necessary in order to consider a path "normalized". Hence,
"path_is_safe()" is slightly misleading a name, but
"path_is_normalize()" is more descriptive, hence let's rename things
accordingly.
No functional changes.
Right now, the option only takes one of two possible values "inactive"
or "inactive-or-failed", the former being the default, and exposing same
behaviour as the status quo ante. If set to "inactive-or-failed" units
may be collected by the GC logic when in the "failed" state too.
This logic should be a nicer alternative to using the "-" modifier for
ExecStart= and friends, as the exit data is collected and logged about
and only removed when the GC comes along. This should be useful in
particular for per-connection socket-activated services, as well as
"systemd-run" command lines that shall leave no artifacts in the
system.
I was thinking about whether to expose this as a boolean, but opted for
an enum instead, as I have the suspicion other tweaks like this might be
a added later on, in which case we extend this setting instead of having
to add yet another one.
Also, let's add some documentation for the GC logic.
When preparing for a restart we quickly go through the DEAD/INACTIVE
service state before entering AUTO_RESTART. When doing this, we need to
make sure we don't destroy the FD store. Previously this was done by
checking the failure state of the unit, and keeping the FD store around
when the unit failed, under the assumption that the restart logic will
then get into action.
This is not entirely correct howver, as there might be failure states
that will no result in restarts.
With this commit we slightly alter the logic: a ref counter for the fd
store is added, that is increased right before we handle the restart
logic, and decreased again right-after.
This should ensure that the fdstore lives exactly as long as it needs.
Follow-up for f0bfbfac43.
And let's make use of it to implement two new unit settings with it:
1. LogLevelMax= is a new per-unit setting that may be used to configure
log priority filtering: set it to LogLevelMax=notice and only
messages of level "notice" and lower (i.e. more important) will be
processed, all others are dropped.
2. LogExtraFields= is a new per-unit setting for configuring per-unit
journal fields, that are implicitly included in every log record
generated by the unit's processes. It takes field/value pairs in the
form of FOO=BAR.
Also, related to this, one exisiting unit setting is ported to this new
facility:
3. The invocation ID is now pulled from /run/systemd/units/ instead of
cgroupfs xattrs. This substantially relaxes requirements of systemd
on the kernel version and the privileges it runs with (specifically,
cgroupfs xattrs are not available in containers, since they are
stored in kernel memory, and hence are unsafe to permit to lesser
privileged code).
/run/systemd/units/ is a new directory, which contains a number of files
and symlinks encoding the above information. PID 1 creates and manages
these files, and journald reads them from there.
Note that this is supposed to be a direct path between PID 1 and the
journal only, due to the special runtime environment the journal runs
in. Normally, today we shouldn't introduce new interfaces that (mis-)use
a file system as IPC framework, and instead just an IPC system, but this
is very hard to do between the journal and PID 1, as long as the IPC
system is a subject PID 1 manages, and itself a client to the journal.
This patch cleans up a couple of types used in journal code:
specifically we switch to size_t for a couple of memory-sizing values,
as size_t is the right choice for everything that is memory.
Fixes: #4089Fixes: #3041Fixes: #4441
Previously it was not possible to select which controllers to enable for
a unit where Delegate=yes was set, as all controllers were enabled. With
this change, this is made configurable, and thus delegation units can
pick specifically what they want to manage themselves, and what they
don't care about.
This replaces the dependencies Set* objects by Hashmap* objects, where
the key is the depending Unit, and the value is a bitmask encoding why
the specific dependency was created.
The bitmask contains a number of different, defined bits, that indicate
why dependencies exist, for example whether they are created due to
explicitly configured deps in files, by udev rules or implicitly.
Note that memory usage is not increased by this change, even though we
store more information, as we manage to encode the bit mask inside the
value pointer each Hashmap entry contains.
Why this all? When we know how a dependency came to be, we can update
dependencies correctly when a configuration source changes but others
are left unaltered. Specifically:
1. We can fix UDEV_WANTS dependency generation: so far we kept adding
dependencies configured that way, but if a device lost such a
dependency we couldn't them again as there was no scheme for removing
of dependencies in place.
2. We can implement "pin-pointed" reload of unit files. If we know what
dependencies were created as result of configuration in a unit file,
then we know what to flush out when we want to reload it.
3. It's useful for debugging: "systemd-analyze dump" now shows
this information, helping substantially with understanding how
systemd's dependency tree came to be the way it came to be.
The units that are not loaded don't have dropin_paths set. This
currently results in units that have fragments to always have
NeedDaemonReload=true when masked:
$ find {/usr/lib,/run/user/8086}/systemd/user/meh.service* |xargs ls -ld
lrwxrwxrwx. 1 lkundrak lkundrak 9 Oct 11 11:19 /run/user/8086/systemd/user/meh.service -> /dev/null
-rw-rw-r--. 1 root root 49 Oct 11 10:16 /usr/lib/systemd/user/meh.service
drwxrwxr-x. 2 root root 4096 Oct 11 10:50 /usr/lib/systemd/user/meh.service.d
-rw-rw-r--. 1 root root 666 Oct 11 10:50 /usr/lib/systemd/user/meh.service.d/override.conf
$ systemctl --user daemon-reload
$ busctl --user get-property org.freedesktop.systemd1 \
/org/freedesktop/systemd1/unit/meh_2eservice \
org.freedesktop.systemd1.Unit NeedDaemonReload
b true
In most cases we followed the rule that the special _INVALID and _MAX
values we use in our enums use the full type name as prefix (in contrast
to regular values that we often make shorter), do so for
ExecDirectoryType as well.
No functional changes, just a little bit of renaming to make this code
more like the rest.
This slightly changes how we log about failures. Previously,
service_enter_dead() would log that a service unit failed along with its
result code, and unit_notify() would do this again but without the
result code. For other unit types only the latter would take effect.
This cleans this up: we keep the message in unit_notify() only for debug
purposes, and add type-specific log lines to all our unit types that can
fail, and always place them before unit_notify() is invoked.
Or in other words: the duplicate log message for service units is
removed, and all other unit types get a more useful line with the
precise result code.
This makes sure that if we learn via inotify or another event source
that a cgroup is empty, and we checked that this is indeed the case (as
we might get spurious notifications through inotify, as the inotify
logic through the "cgroups.event" is pretty unspecific and might be
trigger for a variety of reasons), then we'll enqueue a defer event for
it, at a priority lower than SIGCHLD handling, so that we know for sure
that if there's waitid() data for a process we used it before
considering the cgroup empty notification.
Fixes: #6608
We are about to add second cgroup-related queue, called
"cgroup_empty_queue", hence let's rename "cgroup_queue" to
"cgroup_realize_queue" (as that is its purpose) to minimize confusion
about the two queues.
Just a rename, no functional changes.
This adds a new recognizable log message for each unit invocation that
contains structured information about consumed resources of the unit as
a whole after it terminated. This is particular useful for apps that
want to figure out what the resource consumption of a unit given a
specific invocation ID was.
The log message is only generated for units that have at least one
XyzAccounting= property turned on, and currently only covers IP traffic and CPU
time metrics.
With this change we'll invalidate all cgroup settings after coming back
from a daemon reload/reexec, so that the new settings are instantly
applied.
This is useful for the BPF case, because we don't serialize/deserialize
the BPF program fd, and hence have to install a new, updated BPF program
when coming back from the reload/reexec. However, this is also useful
for the rest of the cgroup settings, as it ensures that user
configuration really takes effect wherever we can.
Make sure the current IP accounting counters aren't lost during
reload/reexec.
Note that we destroy all BPF file objects during a reload: the BPF
programs, the access and the accounting maps. The former two need to be
regenerated anyway with the newly loaded configuration data, but the
latter one needs to survive reloads/reexec. In this implementation I
opted to only save/restore the accounting map content instead of the map
itself. While this opens a (theoretic) window where IP traffic is still
accounted to the old map after we read it out, and we thus miss a few
bytes this has the benefit that we can alter the map layout between
versions should the need arise.
Let's make sure that a socket unit's IPAddressAllow=/IPAddressDeny=
settings are in effect on all socket fds associated with it. In order to
make this happen we need to make sure the cgroup the fds are associated
with are the socket unit's cgroup. The only way to do that is invoking
socket()+accept() in them. Since we really don't want to migrate PID 1
around we do this by forking off a helper process, which invokes
socket()/accept() and sends the newly created fd to PID 1. Ugly, but
works, and there's apparently no better way right now.
This generalizes forking off per-unit helper processes in a new function
unit_fork_helper_process(), which is then also used by the NSS chown()
code of socket units.
Usually, it's a good thing that we isolate the kernel session keyring
for the various services and disconnect them from the user keyring.
However, in case of the cryptsetup key caching we actually want that
multiple instances of the cryptsetup service can share the keys in the
root user's user keyring, hence we need to be able to disable this logic
for them.
This adds KeyringMode=inherit|private|shared:
inherit: don't do any keyring magic (this is the default in systemd --user)
private: a private keyring as before (default in systemd --system)
shared: the new setting
The new unit_set_exec_params() call is to units what
manager_set_exec_params() is to the manager object: it initializes the
various fields from the relevant generic properties set.
We just abort startup, without printing any error. Make sure we always
print something, and when we cannot deserialize some unit, just ignore it and
continue.
Fixup for 4bc5d27b94. Without this, we would hang
in daemon-reexec after upgrade.
Since busname units are only useful with kdbus, they weren't actively
used. This was dead code, only compile-tested. If busname units are
ever added back, it'll be cleaner to start from scratch (possibly reverting
parts of this patch).
This reverts commit 2d058a87ff.
When we add another name to a unit (by following an alias), we need to
reload all drop-ins. This is necessary to load any additional dropins
found in the dirs created from the alias name.
Fixes#6334.
When running systemd-analyze verify I would get a random subset of warnings
(sometimes none, sometimes one or two):
dev-mapper-luks\x2d8db85dcf\x2d6230\x2d4e88\x2d940d\x2dba176d062b31.swap: Unit is bound to inactive unit dev-mapper-luks\x2d8db85dcf\x2d6230\x2d4e88\x2d940d\x2dba176d062b31.device. Stopping, too.
home.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device. Stopping, too.
boot.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-56c56bfd\x2d93f0\x2d48fb\x2dbc4b\x2d90aa67144ea5.device. Stopping, too.
When running with debug on, it's pretty obvious what is happening:
home.mount: Changed dead -> mounted
home.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device. Stopping, too.
home.mount: Trying to enqueue job home.mount/stop/fail
home.mount: Installed new job home.mount/stop as 27
home.mount: Enqueued job home.mount/stop as 27
...
dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device: Installed new job dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device/start as 47
dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device: Changed dead -> plugged
dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device: Job dev-disk-by\x2duuid-75751556\x2d6e31\x2d438b\x2d99c9\x2dd626330d9a1b.device/start finished, result=done
Fixes#2206, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=808151.
When a DBus name is released, NameOwnerChanged signal contains an empty string
as new_owner. Commit bbc2908 changed interpretation of the empty string to a
valid name, which is not consistent with values that are sent by dbus-daemon.
As a side effect, this masks symptoms of systemd-logind dbus disconnections
(#2925) by completely restarting it so it can freshly reconnect to dbus.
The state of a unit was not fully restored, especially the
"cgroup_realized_mask/cgroup_enabled_mask" fields were missing.
This could be seen with the following sequence:
$ systemctl show -p TasksCurrent sshd
TasksCurrent=1
$ systemctl daemon-reload
$ systemctl show -p TasksCurrent sshd
TasksCurrent=18446744073709551615
This was also visible with the "status" command: "Tasks: " row wasn't
showed in status of a service after a "daemon-reload" command.
Unit.JobTimeoutSec starts when a job is enqueued in a transaction. The
introduced distinct Unit.JobRunningTimeoutSec starts only when the job starts
running (e.g. it groups all Exec* commands of a service or spans waiting for a
device period.)
Unit.JobRunningTimeoutSec is intended to be used by default instead of
Unit.JobTimeoutSec for device units where such behavior causes less confusion
(consider a job for a _netdev mount device, with this change the timeout will
start ticking only after the network is ready).
log_struct takes multiple format strings, each one followed by arguments.
The _printf_ annotation is not sufficiently flexible to express this,
but we can still annotate the first format string, though not its
arguments (because their number is unknown).
With the annotation, the places which specified the message id or similar
as the first pattern cause a warning from -Wformat-nonliteral. This can
be trivially fixed by putting the MESSAGE= first.
This change will help find issues where a non-literal is erroneously used
as the pattern.
If the unit's SourcePath is below /proc then it's a unit genreated from
a kernel resource (such as a .mount or .swap unit). And those we watch
anyway, and hence should never be out-of-date.
Fixes: #5461
If a client pins a unit, then it makes sense to also implicitly make it
a subscriber. This is useful for clients that just want to watch one
specific unit: they can pin it and receive its messages.
We use our cgroup APIs in various contexts, including from our libraries
sd-login, sd-bus. As we don#t control those environments we can't rely
that the unified cgroup setup logic succeeds, and hence really shouldn't
assert on it.
This more or less reverts 415fc41cea.
cg_[all_]unified() test whether a specific controller or all controllers are on
the unified hierarchy. While what's being asked is a simple binary question,
the callers must assume that the functions may fail any time, which
unnecessarily complicates their usages. This complication is unnecessary.
Internally, the test result is cached anyway and there are only a few places
where the test actually needs to be performed.
This patch simplifies cg_[all_]unified().
* cg_[all_]unified() are updated to return bool. If the result can't be
decided, assertion failure is triggered. Error handlings from their callers
are dropped.
* cg_unified_flush() is updated to calculate the new result synchrnously and
return whether it succeeded or not. Places which need to flush the test
result are updated to test for failure. This ensures that all the following
cg_[all_]unified() tests succeed.
* Places which expected possible cg_[all_]unified() failures are updated to
call and test cg_unified_flush() before calling cg_[all_]unified(). This
includes functions used while setting up mounts during boot and
manager_setup_cgroup().
Embedding sd_id128_t's in constant strings was rather cumbersome. We had
SD_ID128_CONST_STR which returned a const char[], but it had two problems:
- it wasn't possible to statically concatanate this array with a normal string
- gcc wasn't really able to optimize this, and generated code to perform the
"conversion" at runtime.
Because of this, even our own code in coredumpctl wasn't using
SD_ID128_CONST_STR.
Add a new macro to generate a constant string: SD_ID128_MAKE_STR.
It is not as elegant as SD_ID128_CONST_STR, because it requires a repetition
of the numbers, but in practice it is more convenient to use, and allows gcc
to generate smarter code:
$ size .libs/systemd{,-logind,-journald}{.old,}
text data bss dec hex filename
1265204 149564 4808 1419576 15a938 .libs/systemd.old
1260268 149564 4808 1414640 1595f0 .libs/systemd
246805 13852 209 260866 3fb02 .libs/systemd-logind.old
240973 13852 209 255034 3e43a .libs/systemd-logind
146839 4984 34 151857 25131 .libs/systemd-journald.old
146391 4984 34 151409 24f71 .libs/systemd-journald
It is also much easier to check if a certain binary uses a certain MESSAGE_ID:
$ strings .libs/systemd.old|grep MESSAGE_ID
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
MESSAGE_ID=%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x
$ strings .libs/systemd|grep MESSAGE_ID
MESSAGE_ID=c7a787079b354eaaa9e77b371893cd27
MESSAGE_ID=b07a249cd024414a82dd00cd181378ff
MESSAGE_ID=641257651c1b4ec9a8624d7a40a9e1e7
MESSAGE_ID=de5b426a63be47a7b6ac3eaac82e2f6f
MESSAGE_ID=d34d037fff1847e6ae669a370e694725
MESSAGE_ID=7d4958e842da4a758f6c1cdc7b36dcc5
MESSAGE_ID=1dee0369c7fc4736b7099b38ecb46ee7
MESSAGE_ID=39f53479d3a045ac8e11786248231fbf
MESSAGE_ID=be02cf6855d2428ba40df7e9d022f03d
MESSAGE_ID=7b05ebc668384222baa8881179cfda54
MESSAGE_ID=9d1aaa27d60140bd96365438aad20286
Let's make sure we verify that all BindsTo= are in order before we actually go
and dispatch a start operation to a unit. Normally the job queue should already
have made sure all deps are in order, but this might not have been sufficient
in two cases: a) when the user changes deps during runtime and reloads the
daemon, and b) when the user placed BindsTo= dependencies without matching
After= dependencies, so that we don't actually wait for the bound to unit to be
up before upping also the binding unit.
See: #4725
This is similar to RootDirectory= but mounts the root file system from a
block device or loopback file instead of another directory.
This reuses the image dissector code now used by nspawn and
gpt-auto-discovery.
When an alias is loaded, we resolve this alias to its final unit first to load
the dropin data.
However if the final unit was already loaded, there's no point in reloading the
dropin data a second time.
This patch optimizes this case.
Also this allows the dropin loading code to assume that only units not yet
loaded are passed down. This assumption is not yet used but might be in the
future.
[zj: invert the condition in the if]
Since commit 9d06297, mount units from mountinfo are not bound to their devices
anymore (they use the "Requires" dependency instead).
This has the following drawback: if a media is mounted and the eject button is
pressed then the media is unconditionally ejected leaving some inconsistent
states.
Since udev is the component that is reacting (no matter if the device is used
or not) to the eject button, users expect that udev at least try to unmount the
media properly.
This patch introduces a new property "SYSTEMD_MOUNT_DEVICE_BOUND". When set on
a block device, all units that requires this device will see their "Requires"
dependency upgraded to a "BindTo" one. This is currently only used by cdrom
devices.
This patch also gives the possibility to the user to restore the previous
behavior that is bind a mount unit to a device. This is achieved by passing the
"x-systemd.device-bound" option to mount(8). Please note that currently this is
not working because libmount treats the x-* options has comments therefore
they're not available in utab for later application retrievals.
We generally try to make our destructors robust regarding NULL pointers, much
in the same way as glibc's free(). Do this also for unit_free().
Follow-up for #4748.
For some reasons units remaining in the same process group as PID 1
(same_pgrp=true) fail to acquire the console even if it's not taken by anyone.
So always accept for units with same_pgrp set for now.
In contrast to all other unit types device units when queued just track
external state, they cannot effect state changes on their own. Hence unless a
client or other job waits for them there's no reason to keep them in the job
queue. This adds a concept of GC'ing jobs of this type as soon as no client or
other job waits for them anymore.
To ensure this works correctly we need to track which clients actually
reference a job (i.e. which ones enqueued it). Unfortunately that's pretty
nasty to do for direct connections, as sd_bus_track doesn't work for
them. For now, work around this, by simply remembering in a boolean that a job
was requested by a direct connection, and reset it when we notice the direct
connection is gone. This means the GC logic works fine, except that jobs are
not immediately removed when direct connections disconnect.
In the longer term, a rework of the bus logic should fix this properly. For now
this should be good enough, as GC works for fine all cases except this one, and
thus is a clear improvement over the previous behaviour.
Fixes: #1921
The no_new_privileged_set variable is not used any more since commit
9b232d3241 that fixed another thing. So remove it. Also no
need to check if we are under user manager, remove that part too.
We don't have plural in the name of any other -util files and this
inconsistency trips me up every time I try to type this file name
from memory. "formats-util" is even hard to pronounce.
This reverts some changes introduced in d054f0a4d4.
xsprintf should be used in cases where we calculated the right buffer
size by hand (using DECIMAL_STRING_MAX and such), and never in cases where
we are printing externally specified strings of arbitrary length.
Fixes#4534.
So far "no_gc" was set on -.slice and init.scope, to units that are always
running, cannot be stopped and never exist in an "inactive" state. Since these
units are the only users of this flag, let's remodel it and rename it
"perpetual" and let's derive more funcitonality off it. Specifically, refuse
enqueing stop jobs for these units, and report that they are "unstoppable" in
the CanStop bus property.
We would close all the stored fds in service_release_resources(), which of
course broke the whole concept of storing fds over service restart.
Fixes#4408.
This commit adds a `fd` option to `StandardInput=`,
`StandardOutput=` and `StandardError=` properties in order to
connect standard streams to externally named descriptors provided
by some socket units.
This option looks for a file descriptor named as the corresponding
stream. Custom names can be specified, separated by a colon.
If multiple name-matches exist, the first matching fd will be used.
It is allowed for unit files to have an mtime==0, so instead of assuming that
any file that had mtime==0 was masked, use the load_state to filter masked
units.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1384150.
This is useful to turn off explicit module load and unload operations on modular
kernels. This option removes CAP_SYS_MODULE from the capability bounding set for
the unit, and installs a system call filter to block module system calls.
This option will not prevent the kernel from loading modules using the module
auto-load feature which is a system wide operation.
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.
The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.
The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.
The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.
Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.
This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().
PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.
A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.
Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
Let's make sure that services that use DynamicUser=1 cannot leave files in the
file system should the system accidentally have a world-writable directory
somewhere.
This effectively ensures that directories need to be whitelisted rather than
blacklisted for access when DynamicUser=1 is set.
It is useful for clients to be able to read the last CPU usage counter value of
a unit even if the unit is already terminated. Hence, before destroying a
cgroup's cgroup cache the last CPU usage counter and return it if the cgroup is
gone.
This adds two (privileged) bus calls Ref() and Unref() to the Unit interface.
The two calls may be used by clients to pin a unit into memory, so that various
runtime properties aren't flushed out by the automatic GC. This is necessary
to permit clients to race-freely acquire runtime results (such as process exit
status/code or accumulated CPU time) on successful service termination.
Ref() and Unref() are fully recursive, hence act like the usual reference
counting concept in C. Taking a reference is a privileged operation, as this
allows pinning units into memory which consumes resources.
Transient units may also gain a reference at the time of creation, via the new
AddRef property (that is only defined for transient units at the time of
creation).
This adds the boolean RemoveIPC= setting to service, socket, mount and swap
units (i.e. all unit types that may invoke processes). if turned on, and the
unit's user/group is not root, all IPC objects of the user/group are removed
when the service is shut down. The life-cycle of the IPC objects is hence bound
to the unit life-cycle.
This is particularly relevant for units with dynamic users, as it is essential
that no objects owned by the dynamic users survive the service exiting. In
fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set.
In order to communicate the UID/GID of an executed process back to PID 1 this
adds a new "user lookup" socket pair, that is inherited into the forked
processes, and closed before the exec(). This is needed since we cannot do NSS
from PID 1 due to deadlock risks, However need to know the used UID/GID in
order to clean up IPC owned by it if the unit shuts down.
Currently, systemd uses either the legacy hierarchies or the unified hierarchy.
When the legacy hierarchies are used, systemd uses a named legacy hierarchy
mounted on /sys/fs/cgroup/systemd without any kernel controllers for process
management. Due to the shortcomings in the legacy hierarchy, this involves a
lot of workarounds and complexities.
Because the unified hierarchy can be mounted and used in parallel to legacy
hierarchies, there's no reason for systemd to use a legacy hierarchy for
management even if the kernel resource controllers need to be mounted on legacy
hierarchies. It can simply mount the unified hierarchy under
/sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies.
This disables a significant amount of fragile workaround logics and would allow
using features which depend on the unified hierarchy membership such bpf cgroup
v2 membership test. In time, this would also allow deleting the said
complexities.
This patch updates systemd so that it prefers the unified hierarchy for the
systemd cgroup controller hierarchy when legacy hierarchies are used for kernel
resource controllers.
* cg_unified(@controller) is introduced which tests whether the specific
controller in on unified hierarchy and used to choose the unified hierarchy
code path for process and service management when available. Kernel
controller specific operations remain gated by cg_all_unified().
* "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to
force the use of legacy hierarchy for systemd cgroup controller.
* nspawn: By default nspawn uses the same hierarchies as the host. If
UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all. If
0, legacy for all.
* nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of
three options - legacy, only systemd controller on unified, and unified. The
value is passed into mount setup functions and controls cgroup configuration.
* nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount
option is moved to mount_legacy_cgroup_hierarchy() so that it can take an
appropriate action depending on the configuration of the host.
v2: - CGroupUnified enum replaces open coded integer values to indicate the
cgroup operation mode.
- Various style updates.
v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2.
v4: Restored legacy container on unified host support and fixed another bug in
detect_unified_cgroup_hierarchy().
A following patch will update cgroup handling so that the systemd controller
(/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel
resource controllers are on the legacy hierarchies. This would require
distinguishing whether all controllers are on cgroup v2 or only the systemd
controller is. In preparation, this patch renames cg_unified() to
cg_all_unified().
This patch doesn't cause any functional changes.
Unfortunately, due to the disagreements in the kernel development community,
CPU controller cgroup v2 support has not been merged and enabling it requires
applying two small out-of-tree kernel patches. The situation is explained in
the following documentation.
https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu
While it isn't clear what will happen with CPU controller cgroup v2 support,
there are critical features which are possible only on cgroup v2 such as
buffered write control making cgroup v2 essential for a lot of workloads. This
commit implements systemd CPU controller support on the unified hierarchy so
that users who choose to deploy CPU controller cgroup v2 support can easily
take advantage of it.
On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max"
replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us". [Startup]CPUWeight config
options are added with the usual compat translation. CPU quota settings remain
unchanged and apply to both legacy and unified hierarchies.
v2: - Error in man page corrected.
- CPU config application in cgroup_context_apply() refactored.
- CPU accounting now works on unified hierarchy.
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).
For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.
A simple way to test this is:
systemd-run -p DynamicUser=1 /bin/sleep 99999
Let's lot at LOG_NOTICE about any processes that we are going to
SIGKILL/SIGABRT because clean termination of them didn't work.
This turns the various boolean flag parameters to cg_kill(), cg_migrate() and
related calls into a single binary flags parameter, simply because the function
now gained even more parameters and the parameter listed shouldn't get too
long.
Logging for killing processes is done either when the kill signal is SIGABRT or
SIGKILL, or on explicit request if KILL_TERMINATE_AND_LOG instead of LOG_TERMINATE
is passed. This isn't used yet in this patch, but is made use of in a later
patch.
The unit load queue can be processed in the middle of setting the
unit's properties, so its load_state would no longer be UNIT_STUB
for the check in bus_unit_set_properties(), which would cause it to
incorrectly return an error.
By default, each iteration of manager_dispatch_sigchld() results in a unit level
sigchld event being invoked. For scope units, this results in a scope_sigchld_event()
which can seemingly stall for workloads that have a large number of PIDs within the
scope. The stall exhibits itself as a SIG_0 being initiated for each u->pids entry
as a result of pid_is_unwaited().
v2:
This patch resolves this condition by only paying to cost of a sigchld in the underlying
scope unit once per sigchld iteration. A new "sigchldgen" member resides within the
Unit struct. The Manager is incremented via the sd event loop, accessed via
sd_event_get_iteration, and the Unit member is set to the same value as the manager each
time that a sigchld event is invoked. If the Manager iteration value and Unit member
match, the sigchld event is not invoked for that iteration.
unit_write_drop_in{,_private}{,_format} are all affected.
We already append a header to the file (and section markers), so those functions
can only be used to write a whole file at once. Including the newline at
the end feels natural.
After this commit newlines will be duplicated. They will be removed in
subsequent commit.
Also, rewrap the "autogenerated" header to fit within 80 columns.
We currently generate log message about unit being started even when
unit was started already and job didn't do anything. This is because job
was requested explicitly and hence became anchor job of the transaction
thus we could not eliminate it. That is fine but, let's not pollute
journal with useless log messages.
$ systemctl start systemd-resolved
$ systemctl start systemd-resolved
$ systemctl start systemd-resolved
Current state:
$ journalctl -u systemd-resolved | grep Started
May 05 15:31:42 rawhide systemd[1]: Started Network Name Resolution.
May 05 15:31:59 rawhide systemd[1]: Started Network Name Resolution.
May 05 15:32:01 rawhide systemd[1]: Started Network Name Resolution.
After patch applied:
$ journalctl -u systemd-resolved | grep Started
May 05 16:42:12 rawhide systemd[1]: Started Network Name Resolution.
Fixes#1723