Commit Graph

96 Commits

Author SHA1 Message Date
Zbigniew Jędrzejewski-Szmek 2fe21124a6 Add open_memstream_unlocked() wrapper 2019-04-12 11:44:57 +02:00
Lennart Poettering eefc66aa8f util: split out some stuff into a new file limits-util.[ch] 2019-03-13 12:16:43 +01:00
Filipe Brandenburger 527ede0c63 core: downgrade CPUQuotaPeriodSec= clamping logs to debug
After the first warning log, further messages are downgraded to LOG_DEBUG.
2019-02-14 11:04:42 -08:00
Filipe Brandenburger 10f2864111 core: add CPUQuotaPeriodSec=
This new setting allows configuration of CFS period on the CPU cgroup, instead
of using a hardcoded default of 100ms.

Tested:
- Legacy cgroup + Unified cgroup
- systemctl set-property
- systemctl show
- Confirmed that the cgroup settings (such as cpu.cfs_period_ns) were set
  appropriately, including updating the CPU quota (cpu.cfs_quota_ns) when
  CPUQuotaPeriodSec= is updated.
- Checked that clamping works properly when either period or (quota * period)
  are below the resolution of 1ms, or if period is above the max of 1s.
2019-02-14 11:04:42 -08:00
Chris Down c72703e26d cgroup: Add DisableControllers= directive to disable controller in subtree
Some controllers (like the CPU controller) have a performance cost that
is non-trivial on certain workloads. While this can be mitigated and
improved to an extent, there will for some controllers always be some
overheads associated with the benefits gained from the controller.
Inside Facebook, the fix applied has been to disable the CPU controller
forcibly with `cgroup_disable=cpu` on the kernel command line.

This presents a problem: to disable or reenable the controller, a reboot
is required, but this is quite cumbersome and slow to do for many
thousands of machines, especially machines where disabling/enabling a
stateful service on a machine is a matter of several minutes.

Currently systemd provides some configuration knobs for these in the
form of `[Default]CPUAccounting`, `[Default]MemoryAccounting`, and the
like. The limitation of these is that Default*Accounting is overrideable
by individual services, of which any one could decide to reenable a
controller within the hierarchy at any point just by using a controller
feature implicitly (eg. `CPUWeight`), even if the use of that CPU
feature could just be opportunistic. Since many services are provided by
the distribution, or by upstream teams at a particular organisation,
it's not a sustainable solution to simply try to find and remove
offending directives from these units.

This commit presents a more direct solution -- a DisableControllers=
directive that forcibly disallows a controller from being enabled within
a subtree.
2018-12-03 15:40:31 +00:00
Chris Down f98c25850f cgroup v2: Don't require CPU controller for CPU accounting in 4.15+
systemd only uses functions that are as of Linux 4.15+ provided
externally to the CPU controller (currently usage_usec), so if we have a
new enough kernel, we don't need to set CGROUP_MASK_CPU for
CPUAccounting=true as the CPU controller does not need to necessarily be
enabled in this case.

Part of this patch is modelled on an earlier patch by Ryutaroh Matsumoto
(see PR #9665).
2018-11-18 12:21:41 +00:00
Tejun Heo 6ae4283cb1 core: add IODeviceLatencyTargetSec
This adds support for the following proposed latency based IO control
mechanism.

  https://lkml.org/lkml/2018/6/5/428
2018-08-22 16:46:18 +02:00
Tejun Heo 4842263577 core: add MemoryMin
The kernel added support for a new cgroup memory controller knob memory.min in
bf8d5d52ffe8 ("memcg: introduce memory.min") which was merged during v4.18
merge window.

Add MemoryMin to support memory.min.
2018-07-12 08:21:43 +02:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Lennart Poettering 57e84e7535 core: rework how we validate DeviceAllow= settings
Let's make sure we don't validate "char-*" and "block-*" expressions as
paths.
2018-06-11 18:01:06 +02:00
Lennart Poettering 9d5e9b4add cgroup: relax checks for block device cgroup settings
This drops needless safety checks that ensure we only reference block
devices for blockio/io settings. The backing code was already able to
accept regular file system paths too, in which case the backing device
node of that file system would be used. Hence, let's drop the artificial
restrictions and open up this underlying functionality.
2018-06-11 18:01:06 +02:00
Giuseppe Scrivano ef42f561fc src/core/dbus-cgroup.c: fix typo contoller -> controller (#8717)
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2018-04-14 11:06:11 +02:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Yu Watanabe 906bdbf5e7 core/cgroup: accepts MemorySwapMax=0 (#8366)
Also, this moves two macros from dbus-util.h to dbus-cgroup.c,
as they are only used in dbus-cgroup.c.

Fixes #8363.
2018-03-09 11:34:50 +01:00
Lennart Poettering 2ae7ee58fa bpf: beef up bpf detection, check if BPF_F_ALLOW_MULTI is supported
This improves the BPF/cgroup detection logic, and looks whether
BPF_ALLOW_MULTI is supported. This flag allows execution of multiple
BPF filters in a recursive fashion for a whole cgroup tree. It enables
us to properly report IP accounting for slice units, as well as
delegation of BPF support to units without breaking our own IP
accounting.
2018-02-21 16:43:36 +01:00
Lennart Poettering 1d9cc8768f cgroup: add a new "can_delegate" flag to the unit vtable, and set it for scope and service units only
Currently we allowed delegation for alluntis with cgroup backing
except for slices. Let's make this a bit more strict for now, and only
allow this in service and scope units.

Let's also add a generic accessor unit_cgroup_delegate() for checking
whether a unit has delegation turned on that checks the new bool first.

Also, when doing transient units, let's explcitly refuse turning on
delegation for unit types that don#t support it. This is mostly
cosmetical as we wouldn't act on the delegation request anyway, but
certainly helpful for debugging.
2018-02-12 11:34:00 +01:00
Lennart Poettering 60644c3dea cgroup: fix handling of TasksAccounting= property 2018-01-23 21:22:50 +01:00
Yu Watanabe 681ae88e06 dbus-cgroup: simplify bus_cgroup_set_property() 2018-01-03 02:33:16 +09:00
Yu Watanabe fffbc1dc7f dbus-cgroup: add missing space 2018-01-03 02:33:06 +09:00
Yu Watanabe 32048f5414 cgroup: IODeviceWeight= or friends can take device node files in /run/systemd/inaccessible/
systemd creates several device nodes in /run/systemd/inaccessible/.
This makes CGroup's settings related to IO can take device node
files in the directory.
2017-12-23 19:32:42 +09:00
Yu Watanabe 13ec20d42a dbus-cgroup: merge several blocks which operate almost same tasks 2017-12-23 19:32:36 +09:00
Yu Watanabe d9f7305fd7 cgroup: move path checking logic to dbus-cgroup.c 2017-12-23 19:32:29 +09:00
Lennart Poettering 0d53667334 tree-wide: use __fsetlocking() instead of fxyz_unlocked()
Let's replace usage of fputc_unlocked() and friends by __fsetlocking(f,
FSETLOCKING_BYCALLER). This turns off locking for the entire FILE*,
instead of doing individual per-call decision whether to use normal
calls or _unlocked() calls.

This has various benefits:

1. It's easier to read and easier not to forget

2. It's more comprehensive, as fprintf() and friends are covered too
   (as these functions have no _unlocked() counterpart)

3. Philosophically, it's a bit more correct, because it's more a
   property of the file handle really whether we ever pass it on to another
   thread, not of the operations we then apply to it.

This patch reworks all pieces of codes that so far used fxyz_unlocked()
calls to use __fsetlocking() instead. It also reworks all places that
use open_memstream(), i.e. use stdio FILE* for string manipulations.

Note that this in some way a revert of 4b61c87511.
2017-12-14 10:42:25 +01:00
Lennart Poettering 66a892ae3d core: accept MemorySwapMax= properties that are scaled, too
Let's do what we already do for MemoryMax= and friends for
MemorySwapMax= too.
2017-11-29 20:12:26 +01:00
Lennart Poettering e74f76ca86 tree-wide: generate SD_BUS_ERROR_INVALID_ARGS when we get invalid arguments on bus calls
Let's make sure that when we return a D-Bus error, we return a native
one, if we generate it ourselves, and use errno-based error
synthetization only if we received an errno ourselves. Yes, this makes
things slightly longer, but is highly misleading as we propagate D-Bus
errors, and not errnos to the client.
2017-11-29 12:34:12 +01:00
Lennart Poettering 2e59b241ca core: add proper escaping to writing of drop-ins/transient unit files
This majorly refactors the transient unit file and drop-in writing
logic, so that we properly C-escape and specifier-escape (% → %%)
everything we write out, so that when we read it back again, specifiers
are parsed that aren't supposed to be parsed.

This renames unit_write_drop_in() and friends by unit_write_setting().
The name change is supposed to clarify that the functions are not only
used to write drop-in files, but also transient unit files.

The previous "mode" parameter to this function is replaced by a more
generic "flags", which knows additional flags for implicit C-style and
specifier escaping before writing things out. This can cover most
properties where either form of escaping is defined. For the cases where
this isn't sufficient, we add helpers unit_escape_setting() and
unit_concat_strv() for escaping individual strings or strvs properly.

While we are at it, we also prettify generation of transient unit files:
we try to reduce the number of section headers written out: previously
we'd write the right section header our for each setting. With this
change we do so only if the setting lives in a different section than
the one before.

(This should also be considered preparation for when we add proper APIs
to systemd to write normal, persistant unit files through the bus API)
2017-11-29 12:34:12 +01:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Yu Watanabe 1bdfc7b951 core/cgroup: assigning empty string to Delegate= resets list of controllers (#7336)
Before this, assigning empty string to Delegate= makes no change to the
controller list. This is inconsistent to the other options that take list
of strings. After this, when empty string is assigned to Delegate=, the
list of controllers is reset. Such behavior is consistent to other options
and useful for drop-in configs.

Closes #7334.
2017-11-17 10:04:25 +01:00
Lennart Poettering ab8519c28b core: only warn about BPF/cgroup missing once per runtime (#7319)
Let's reduce the amount of noise a bit, there's little point in
complaining loudly about every single unit like this, let's complain
only about the first one, and then downgrade the log level to LOG_DEBUG
for the other cases.

Fixes: #7188
2017-11-13 22:02:51 +01:00
Lennart Poettering 0263828039 core: rework the Delegate= unit file setting to take a list of controller names
Previously it was not possible to select which controllers to enable for
a unit where Delegate=yes was set, as all controllers were enabled. With
this change, this is made configurable, and thus delegation units can
pick specifically what they want to manage themselves, and what they
don't care about.
2017-11-13 10:49:15 +01:00
Lennart Poettering 4fe66c8681 core: improve dbus-cgroup error message
As suggested by @keszybz in the review of #6764
2017-09-26 23:49:40 +02:00
Lennart Poettering 078ba556da core: warn loudly if IP firewalling is configured but not in effect 2017-09-22 15:24:55 +02:00
Lennart Poettering 1274b6c687 ip-address-access: minimize IP address lists
Let's drop redundant items from the IP address list after parsing. Let's
also mask out redundant bits hidden by the prefixlength.
2017-09-22 15:24:55 +02:00
Lennart Poettering 3dc5ca9787 core: support IP firewalling to be configured for transient units 2017-09-22 15:24:55 +02:00
Lennart Poettering 27458ed629 tree-wide: use path_startswith() rather than startswith() where ever that's appropriate
When checking path prefixes we really should use the right APIs, just in
case people add multiple slashes to their paths...
2017-08-09 19:03:39 +02:00
Lennart Poettering 4b61c87511 tree-wide: fput[cs]() → fput[cs]_unlocked() wherever that makes sense (#6396)
As a follow-up for db3f45e2d2 let's do the
same for all other cases where we create a FILE* with local scope and
know that no other threads hence can have access to it.

For most cases this shouldn't change much really, but this should speed
dbus introspection and calender time formatting up a bit.
2017-07-21 10:35:45 +02:00
Zbigniew Jędrzejewski-Szmek d4bf82fcac pid1: properly encode infinity when writing CPUQuota snippet (#6141)
We would write
  [Slice]
  CPUQuota=1844674407370955%
which is (numerically) correct, but it seems better to just write
  [Slice]
  CPUQuota=
which is interpreted as USEC_INFINITY by the parser in config_parse_cpu_quota().

Fixes #5965.
2017-06-18 11:18:41 +02:00
WaLyong Cho 96e131ea09 core: introduce MemorySwapMax=
Similar to MemoryMax=, MemorySwapMax= limits swap usage. This controls
controls "memory.swap.max" attribute in unified cgroup.
2016-08-30 11:11:45 +09:00
Tejun Heo 66ebf6c0a1 core: add cgroup CPU controller support on the unified hierarchy
Unfortunately, due to the disagreements in the kernel development community,
CPU controller cgroup v2 support has not been merged and enabling it requires
applying two small out-of-tree kernel patches.  The situation is explained in
the following documentation.

 https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu

While it isn't clear what will happen with CPU controller cgroup v2 support,
there are critical features which are possible only on cgroup v2 such as
buffered write control making cgroup v2 essential for a lot of workloads.  This
commit implements systemd CPU controller support on the unified hierarchy so
that users who choose to deploy CPU controller cgroup v2 support can easily
take advantage of it.

On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max"
replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us".  [Startup]CPUWeight config
options are added with the usual compat translation.  CPU quota settings remain
unchanged and apply to both legacy and unified hierarchies.

v2: - Error in man page corrected.
    - CPU config application in cgroup_context_apply() refactored.
    - CPU accounting now works on unified hierarchy.
2016-08-07 09:45:39 -04:00
Lennart Poettering f7903e8db6 core: rename MemoryLimitByPhysicalMemory transient property to MemoryLimitScale
That way, we can neatly keep this in line with the new TasksMaxScale= option.

Note that we didn't release a version with MemoryLimitByPhysicalMemory= yet,
hence this change should be unproblematic without breaking API.
2016-07-22 15:33:12 +02:00
Lennart Poettering 83f8e80857 core: support percentage specifications on TasksMax=
This adds support for a TasksMax=40% syntax for specifying values relative to
the system's configured maximum number of processes. This is useful in order to
neatly subdivide the available room for tasks within containers.
2016-07-22 15:33:12 +02:00
Alessandro Puccetti 31d28eabc1 nspawn: enable major=0/minor=0 devices inside the container (#3773)
https://github.com/systemd/systemd/pull/3685 introduced
/run/systemd/inaccessible/{chr,blk} to map inacessible devices,
this patch allows systemd running inside a nspawn container to create
/run/systemd/inaccessible/{chr,blk}.
2016-07-21 17:39:38 +02:00
Lennart Poettering d58d600efd systemctl: allow percent-based MemoryLimit= settings via systemctl set-property
The unit files already accept relative, percent-based memory limit
specification, let's make sure "systemctl set-property" support this too.

Since we want the physical memory size of the destination machine to apply we
pass the percentage in a new set of properties that only exist for this
purpose, and can only be set.
2016-06-14 20:01:45 +02:00
Lennart Poettering 799ec13412 core: make sure to use "infinity" in unit files, not "max"
THe latter is a kernelism, we only understand "infinity".
2016-06-14 19:50:38 +02:00
Lennart Poettering cd0a7a8e58 core: when receiving a memory limit via the bus, refuse 0
When parsing unit files we already refuse unit memory limits of zero, let's
also refuse it when the value is set via the bus.
2016-06-14 19:50:38 +02:00
Zbigniew Jędrzejewski-Szmek b27b4b51c6 tree-wide: remove newlines from unit_write_drop_in
This reverts part of #3329, but all for a good cause.
2016-05-28 16:29:42 -04:00
Tejun Heo da4d897e75 core: add cgroup memory controller support on the unified hierarchy (#3315)
On the unified hierarchy, memory controller implements three control knobs -
low, high and max which enables more useable and versatile control over memory
usage.  This patch implements support for the three control knobs.

* MemoryLow, MemoryHigh and MemoryMax are added for memory.low, memory.high and
  memory.max, respectively.

* As all absolute limits on the unified hierarchy use "max" for no limit, make
  memory limit parse functions accept "max" in addition to "infinity" and
  document "max" for the new knobs.

* Implement compatibility translation between MemoryMax and MemoryLimit.

v2:

- Fixed missing else's in config_parse_memory_limit().
- Fixed missing newline when writing out drop-ins.
- Coding style updates to use "val > 0" instead of "val".
- Minor updates to documentation.
2016-05-27 18:10:18 +02:00
Tejun Heo 0c2d96f5f5 core: fix missing newlines when writing out drop-ins for cgroup settings
Except for per-device BlockIO, IO and DeviceAllow/Deny settings, all were
missing newline causing the next drop-in to be concatenated at the end of the
line.  Fix it.
2016-05-23 16:48:46 -04:00
Tejun Heo 6fb0926976 core: fix the reversed sanity check when setting StartupBlockIOWeight over dbus
bus_cgroup_set_property() was rejecting if the input value was in range.
Reverse it.
2016-05-23 16:48:46 -04:00