Commit Graph

94 Commits

Author SHA1 Message Date
Anita Zhang fe8d22fb09 core: systemd-oomd pid1 integration 2020-10-07 17:12:24 -07:00
Anita Zhang 4d824a4e0b core: add ManagedOOM*= properties to configure systemd-oomd on the unit
This adds the hook ups so it can be read with the usual systemd
utilities. Used in later commits by sytemd-oomd.
2020-10-07 16:17:23 -07:00
Michal Koutný d9ef594454 cgroup: Cleanup function usage
Some masks shouldn't be needed externally, so keep their functions in
the module (others would fit there too but they're used in tests) to
think twice if something would depend on them.

Drop unused function cg_attach_many_everywhere.

Use cgroup_realized instead of cgroup_path when we actually ask for
realized.

This should not cause any functional changes.
2020-08-19 11:41:53 +02:00
Michal Koutný 4c591f3996 cgroup: Introduce family queueing instead of siblings
The unit_add_siblings_to_cgroup_realize_queue does more than mere
siblings queueing, hence define a family of a unit as (immediate)
children of the unit and immediate children of all ancestors.

Working with this abstraction simplifies the queuing calls and it
shouldn't change the functionality.
2020-08-19 11:41:53 +02:00
Michal Koutný fb46fca7e0 cgroup: Eager realization in unit_free
unit_free(u) realizes direct parent and invalidates members mask of all
ancestors. This isn't sufficient in v1 controller hierarchies since
siblings of the freed unit may have existed only because of the removed
unit.

We cannot be lazy about the siblings because if parent(u) is also
removed, it'd migrate and rmdir cgroups for siblings(u). However,
realized masks of siblings(u) won't reflect this change.

This was a non-issue earlier, because we weren't removing cgroup
directories properly (effectively matching the stale realized mask),
removal failed because of tasks left by missing migration (see previous
commit).

Therefore, ensure realization of all units necessary to clean up after
the free'd unit.

Fixes: #14149
2020-08-19 11:41:53 +02:00
Michal Sekletár d9e45bc3ab core: introduce support for cgroup freezer
With cgroup v2 the cgroup freezer is implemented as a cgroup
attribute called cgroup.freeze. cgroup can be frozen by writing "1"
to the file and kernel will send us a notification through
"cgroup.events" after the operation is finished and processes in the
cgroup entered quiescent state, i.e. they are not scheduled to
run. Writing "0" to the attribute file does the inverse and process
execution is resumed.

This commit exposes above low-level functionality through systemd's DBus
API. Each unit type must provide specialized implementation for these
methods, otherwise, we return an error. So far only service, scope, and
slice unit types provide the support. It is possible to check if a
given unit has the support using CanFreeze() DBus property.

Note that DBus API has a synchronous behavior and we dispatch the reply
to freeze/thaw requests only after the kernel has notified us that
requested operation was completed.
2020-04-30 19:02:51 +02:00
Zbigniew Jędrzejewski-Szmek 3a0f06c41a core: make TasksMax a partially dynamic property
TasksMax= and DefaultTasksMax= can be specified as percentages. We don't
actually document of what the percentage is relative to, but the implementation
uses the smallest of /proc/sys/kernel/pid_max, /proc/sys/kernel/threads-max,
and /sys/fs/cgroup/pids.max (when present). When the value is a percentage,
we immediately convert it to an absolute value. If the limit later changes
(which can happen e.g. when systemd-sysctl runs), the absolute value becomes
outdated.

So let's store either the percentage or absolute value, whatever was specified,
and only convert to an absolute value when the value is used. For example, when
starting a unit, the absolute value will be calculated when the cgroup for
the unit is created.

Fixes #13419.
2019-11-14 18:41:54 +01:00
Zbigniew Jędrzejewski-Szmek 084870f9c0 core: rename CGROUP_AUTO/STRICT/CLOSED to CGROUP_DEVICE_POLICY_…
The old names were very generic, and when used without context it wasn't at all
clear that they are about the devices policy.
2019-11-10 23:22:15 +01:00
Chris Down bc0623df16 cgroup: analyze: Report memory configurations that deviate from systemd
This is the most basic consumer of the new systemd-vs-kernel checker,
both acting as a reasonable standalone exerciser of the code, and also
as a way for easy inspection of deviations from systemd internal state.
2019-10-03 15:06:25 +01:00
Pavel Hrdina 047f5d63d7 cgroup: introduce support for cgroup v2 CPUSET controller
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller.  This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.

The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written.  However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well.  In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
2019-09-24 15:16:07 +02:00
Michael Olbrich da8e178296 job: make the run queue order deterministic
Jobs are added to the run queue in random order. This happens because most
jobs are added by iterating over the transaction or dependency hash maps.

As a result, jobs that can be executed at the same time are started in a
different order each time.
On small embedded devices this can cause a measurable jitter for the point
in time when a job starts (~100ms jitter for 10 units that are started in
random order).
This results is a similar jitter for the boot time. This is undesirable in
general and make optimizing the boot time a lot harder.
Also, jobs that should have a higher priority because the unit has a higher
CPU weight might get executed later than others.

Fix this by turning the job run_queue into a Prioq and sort by the
following criteria (use the next if the values are equal):
- CPU weight
- nice level
- unit type
- unit name

The last one is just there for deterministic sorting to avoid any jitter.
2019-07-18 10:28:39 +02:00
Kai Lüke fab347489f bpf-firewall: custom BPF programs through IP(Ingress|Egress)FilterPath=
Takes a single /sys/fs/bpf/pinned_prog string as argument, but may be
specified multiple times. An empty assignment resets all previous filters.

Closes https://github.com/systemd/systemd/issues/10227
2019-06-25 09:56:16 +02:00
Zbigniew Jędrzejewski-Szmek 2ba6ae6b2b core: do an extra check if oom was triggered when handling sigchild
Should fix #12425.
2019-05-20 16:37:06 +02:00
Ben Boeckel 5238e95759 codespell: fix spelling errors 2019-04-29 16:47:18 +02:00
Zbigniew Jędrzejewski-Szmek c5b7ae0edb
Merge pull request #12074 from poettering/io-acct
expose IO stats on the bus and in "systemctl status" and "systemd-run --wait"
2019-04-25 11:59:37 +02:00
Chris Down 7ad5439e06 unit: Add DefaultMemoryMin 2019-04-16 18:45:04 +01:00
Chris Down c52db42b78 cgroup: Implement default propagation of MemoryLow with DefaultMemoryLow
In cgroup v2 we have protection tunables -- currently MemoryLow and
MemoryMin (there will be more in future for other resources, too). The
design of these protection tunables requires not only intermediate
cgroups to propagate protections, but also the units at the leaf of that
resource's operation to accept it (by setting MemoryLow or MemoryMin).

This makes sense from an low-level API design perspective, but it's a
good idea to also have a higher-level abstraction that can, by default,
propagate these resources to children recursively. In this patch, this
happens by having descendants set memory.low to N if their ancestor has
DefaultMemoryLow=N -- assuming they don't set a separate MemoryLow
value.

Any affected unit can opt out of this propagation by manually setting
`MemoryLow` to some value in its unit configuration. A unit can also
stop further propagation by setting `DefaultMemoryLow=` with no
argument. This removes further propagation in the subtree, but has no
effect on the unit itself (for that, use `MemoryLow=0`).

Our use case in production is simplifying the configuration of machines
which heavily rely on memory protection tunables, but currently require
tweaking a huge number of unit files to make that a reality. This
directive makes that significantly less fragile, and decreases the risk
of misconfiguration.

After this patch is merged, I will implement DefaultMemoryMin= using the
same principles.
2019-04-12 17:23:58 +02:00
Lennart Poettering fbe14fc9a7 croup: expose IO accounting data per unit
This was the last kind of accounting still not exposed on for each unit.
Let's fix that.

Note that this is a relatively simplistic approach: we don't expose
per-device stats, but sum them all up, much like cgtop does. This kind
of metric is probably the most interesting for most usecases, and covers
the "systemctl status" output best. If we want per-device stats one day
we can of course always add that eventually.
2019-04-12 14:25:44 +02:00
Lennart Poettering 9b2559a13e core: add new call unit_reset_accounting()
It's a simple wrapper for resetting both IP and CPU accounting in one
go.

This will become particularly useful when we also needs this to reset IO
accounting (to be added in a later commit).
2019-04-12 14:25:44 +02:00
Lennart Poettering afcfaa695c core: implement OOMPolicy= and watch cgroups for OOM killings
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).

On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".

All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
2019-04-09 11:17:58 +02:00
Lennart Poettering 0a6991e0bb tree-wide: reorder various structures to make them smaller and use fewer cache lines
Some "pahole" spelunking.
2019-03-27 18:11:11 +01:00
Filipe Brandenburger 10f2864111 core: add CPUQuotaPeriodSec=
This new setting allows configuration of CFS period on the CPU cgroup, instead
of using a hardcoded default of 100ms.

Tested:
- Legacy cgroup + Unified cgroup
- systemctl set-property
- systemctl show
- Confirmed that the cgroup settings (such as cpu.cfs_period_ns) were set
  appropriately, including updating the CPU quota (cpu.cfs_quota_ns) when
  CPUQuotaPeriodSec= is updated.
- Checked that clamping works properly when either period or (quota * period)
  are below the resolution of 1ms, or if period is above the max of 1s.
2019-02-14 11:04:42 -08:00
Zbigniew Jędrzejewski-Szmek 303ee60151 Mark *data and *userdata params to specifier_printf() as const
It would be very wrong if any of the specfier printf calls modified
any of the objects or data being printed. Let's mark all arguments as const
(primarily to make it easier for the reader to see where modifications cannot
occur).
2018-12-12 16:45:33 +01:00
Chris Down c72703e26d cgroup: Add DisableControllers= directive to disable controller in subtree
Some controllers (like the CPU controller) have a performance cost that
is non-trivial on certain workloads. While this can be mitigated and
improved to an extent, there will for some controllers always be some
overheads associated with the benefits gained from the controller.
Inside Facebook, the fix applied has been to disable the CPU controller
forcibly with `cgroup_disable=cpu` on the kernel command line.

This presents a problem: to disable or reenable the controller, a reboot
is required, but this is quite cumbersome and slow to do for many
thousands of machines, especially machines where disabling/enabling a
stateful service on a machine is a matter of several minutes.

Currently systemd provides some configuration knobs for these in the
form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the
like. The limitation of these is that Default*Accounting is overrideable
by individual services, of which any one could decide to reenable a
controller within the hierarchy at any point just by using a controller
feature implicitly (eg. `CPUWeight`), even if the use of that CPU
feature could just be opportunistic. Since many services are provided by
the distribution, or by upstream teams at a particular organisation,
it's not a sustainable solution to simply try to find and remove
offending directives from these units.

This commit presents a more direct solution -- a DisableControllers=
directive that forcibly disallows a controller from being enabled within
a subtree.
2018-12-03 15:40:31 +00:00
Lennart Poettering 5af8805872 cgroup: drastically simplify caching of cgroups members mask
Previously we tried to be smart: when a new unit appeared and it only
added controllers to the cgroup mask we'd update the cached members mask
in all parents by ORing in the controller flags in their cached values.
Unfortunately this was quite broken, as we missed some conditions when
this cache had to be reset (for example, when a unit got unloaded),
moreover the optimization doesn't work when a controller is removed
anyway (as in that case there's no other way for the parent to iterate
though all children if any other, remaining child unit still needs it).
Hence, let's simplify the logic substantially: instead of updating the
cache on the right events (which we didn't get right), let's simply
invalidate the cache, and generate it lazily when we encounter it later.
This should actually result in better behaviour as we don't have to
calculate the new members mask for a whole subtree whever we have the
suspicion something changed, but can delay it to the point where we
actually need the members mask.

This allows us to simplify things quite a bit, which is good, since
validating this cache for correctness is hard enough.

Fixes: #9512
2018-11-23 13:41:37 +01:00
Lennart Poettering 27adcc9737 cgroup: be more careful with which controllers we can enable/disable on a cgroup
This changes cg_enable_everywhere() to return which controllers are
enabled for the specified cgroup. This information is then used to
correctly track the enablement mask currently in effect for a unit.
Moreover, when we try to turn off a controller, and this works, then
this is indicates that the parent unit might succesfully turn it off
now, too as our unit might have kept it busy.

So far, when realizing cgroups, i.e. when syncing up the kernel
representation of relevant cgroups with our own idea we would strictly
work from the root to the leaves. This is generally a good approach, as
when controllers are enabled this has to happen in root-to-leaves order.
However, when controllers are disabled this has to happen in the
opposite order: in leaves-to-root order (this is because controllers can
only be enabled in a child if it is already enabled in the parent, and
if it shall be disabled in the parent then it has to be disabled in the
child first, otherwise it is considered busy when it is attempted to
remove it in the parent).

To make things complicated when invalidating a unit's cgroup membershup
systemd can actually turn off some controllers previously turned on at
the very same time as it turns on other controllers previously turned
off. In such a case we have to work up leaves-to-root *and*
root-to-leaves right after each other. With this patch this is
implemented: we still generally operate root-to-leaves, but as soon as
we noticed we successfully turned off a controller previously turned on
for a cgroup we'll re-enqueue the cgroup realization for all parents of
a unit, thus implementing leaves-to-root where necessary.
2018-11-23 13:41:37 +01:00
Lennart Poettering 1649244588 cgroup: make unit_get_needs_bpf_firewall() static too 2018-11-23 12:24:37 +01:00
Lennart Poettering 53aea74a60 cgroup: make some functions static 2018-11-23 12:24:37 +01:00
Lennart Poettering 611c4f8afb cgroup: rename {manager_owns|unit_has}_root_cgroup() → .._host_root_cgroup()
Let's emphasize that this function checks for the host root cgroup, i.e.
returns false for the root cgroup when we run in a container where
CLONE_NEWCGROUP is used. There has been some confusion around this
already, for example cgroup_context_apply() uses the function
incorrectly (which we'll fix in a later commit).

Just some refactoring, not change in behaviour.
2018-11-23 12:24:37 +01:00
Roman Gushchin 17f149556a core: refactor bpf firewall support into a pseudo-controller
The idea is to introduce a concept of bpf-based pseudo-controllers
to make adding new bpf-based features easier.
2018-10-09 09:46:08 -07:00
Tejun Heo 6ae4283cb1 core: add IODeviceLatencyTargetSec
This adds support for the following proposed latency based IO control
mechanism.

  https://lkml.org/lkml/2018/6/5/428
2018-08-22 16:46:18 +02:00
Yu Watanabe fd870bac25 core: introduce cgroup_add_device_allow() 2018-08-06 13:42:14 +09:00
Tejun Heo 4842263577 core: add MemoryMin
The kernel added support for a new cgroup memory controller knob memory.min in
bf8d5d52ffe8 ("memcg: introduce memory.min") which was merged during v4.18
merge window.

Add MemoryMin to support memory.min.
2018-07-12 08:21:43 +02:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Felipe Sateler 90a8f0b9a9 core: Break circular dependency between unit.h and cgroup.h 2018-05-15 14:23:32 -04:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Lennart Poettering 902c8502ad
Merge pull request #8149 from poettering/fake-root-cgroup
Properly synthesize CPU+memory accounting data for the root cgroup
2018-03-01 11:10:24 +01:00
Lennart Poettering 6592b9759c core: add new new bus call for migrating foreign processes to scope/service units
This adds a new bus call to service and scope units called
AttachProcesses() that moves arbitrary processes into the cgroup of the
unit. The primary user for this new API is systemd itself: the systemd
--user instance uses this call of the systemd --system instance to
migrate processes if itself gets the request to migrate processes and
the kernel refuses this due to access restrictions.

The primary use-case of this is to make "systemd-run --scope --user …"
invoked from user session scopes work correctly on pure cgroupsv2
environments. There, the kernel refuses to migrate processes between two
unprivileged-owned cgroups unless the requestor as well as the ownership
of the closest parent cgroup all match. This however is not the case
between the session-XYZ.scope unit of a login session and the
user@ABC.service of the systemd --user instance.

The new logic always tries to move the processes on its own, but if
that doesn't work when being the user manager, then the system manager
is asked to do it instead.

The new operation is relatively restrictive: it will only allow to move
the processes like this if the caller is root, or the UID of the target
unit, caller and process all match. Note that this means that
unprivileged users cannot attach processes to scope units, as those do
not have "owning" users (i.e. they have now User= field).

Fixes: #3388
2018-02-12 11:34:00 +01:00
Lennart Poettering 1d9cc8768f cgroup: add a new "can_delegate" flag to the unit vtable, and set it for scope and service units only
Currently we allowed delegation for alluntis with cgroup backing
except for slices. Let's make this a bit more strict for now, and only
allow this in service and scope units.

Let's also add a generic accessor unit_cgroup_delegate() for checking
whether a unit has delegation turned on that checks the new bool first.

Also, when doing transient units, let's explcitly refuse turning on
delegation for unit types that don#t support it. This is mostly
cosmetical as we wouldn't act on the delegation request anyway, but
certainly helpful for debugging.
2018-02-12 11:34:00 +01:00
Lennart Poettering cc6271f17d core: turn on memory/cpu/tasks accounting by default for the root slice
The kernel exposes the necessary data in /proc anyway, let's expose it
hence by default.

With this in place "systemctl status -- -.slice" will show accounting
data out-of-the-box now.
2018-02-09 19:07:39 +01:00
Zbigniew Jędrzejewski-Szmek f26f5b60d0 Merge pull request #7915 from poettering/pids-max-tweak 2018-01-25 10:24:35 +01:00
Lennart Poettering 11aef522c1 core: unify call we use to synthesize cgroup empty events when we stopped watching any unit PIDs
This code is very similar in scope and service units, let's unify it in
one function. This changes little for service units, but for scope units
makes sure we go through the cgroup queue, which is something we should
do anyway.
2018-01-23 21:22:50 +01:00
Lennart Poettering f3725e64fe cgroup: add proper API to determine whether our unit manags to root cgroup 2018-01-22 16:26:55 +01:00
Lennart Poettering a4634b214c core: warn about left-over processes in cgroup on unit start
Now that we don't kill control processes anymore, let's at least warn
about any processes left-over in the unit cgroup at the moment of
starting the unit.
2017-11-25 17:08:21 +01:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Lennart Poettering 0263828039 core: rework the Delegate= unit file setting to take a list of controller names
Previously it was not possible to select which controllers to enable for
a unit where Delegate=yes was set, as all controllers were enabled. With
this change, this is made configurable, and thus delegation units can
pick specifically what they want to manage themselves, and what they
don't care about.
2017-11-13 10:49:15 +01:00
Lennart Poettering 09e2465407 cgroup: after determining that a cgroup is empty, asynchronously dispatch this
This makes sure that if we learn via inotify or another event source
that a cgroup is empty, and we checked that this is indeed the case (as
we might get spurious notifications through inotify, as the inotify
logic through the "cgroups.event" is pretty unspecific and might be
trigger for a variety of reasons), then we'll enqueue a defer event for
it, at a priority lower than SIGCHLD handling, so that we know for sure
that if there's waitid() data for a process we used it before
considering the cgroup empty notification.

Fixes: #6608
2017-09-27 18:26:18 +02:00
Lennart Poettering 91a6073ef7 core: rename cgroup_queue → cgroup_realize_queue
We are about to add second cgroup-related queue, called
"cgroup_empty_queue", hence let's rename "cgroup_queue" to
"cgroup_realize_queue" (as that is its purpose) to minimize confusion
about the two queues.

Just a rename, no functional changes.
2017-09-27 17:59:25 +02:00
Zbigniew Jędrzejewski-Szmek 2e4025c0f9 core/cgroup: add a helper macro for a common pattern (#6926) 2017-09-27 17:54:06 +02:00