Commit Graph

291 Commits

Author SHA1 Message Date
Lennart Poettering add005357d core: add new RestrictNamespaces= unit file setting
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().

RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.

This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
2016-11-04 07:40:13 -06:00
Martin Pitt 1740c5a807 Merge pull request #4458 from keszybz/man-nonewprivileges
Document NoNewPrivileges default value
2016-10-28 15:35:29 +02:00
Lennart Poettering 8130926d32 core: rework syscall filter set handling
A variety of fixes:

- rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main
  instance of it (the syscall_filter_sets[] array) used to abbreviate
  "SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not
  mix and match too wildly. Let's pick the shorter name in this case, as it is
  sufficiently well established to not confuse hackers reading this.

- Export explicit indexes into the syscall_filter_sets[] array via an enum.
  This way, code that wants to make use of a specific filter set, can index it
  directly via the enum, instead of having to search for it. This makes
  apply_private_devices() in particular a lot simpler.

- Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to
  find a set by its name, seccomp_add_syscall_filter_set() to add a set to a
  seccomp object.

- Update SystemCallFilter= parser to use extract_first_word().  Let's work on
  deprecating FOREACH_WORD_QUOTED().

- Simplify apply_private_devices() using this functionality
2016-10-24 17:32:50 +02:00
Zbigniew Jędrzejewski-Szmek 9b232d3241 core: do not set no_new_privileges flag in config_parse_syscall_filter
If SyscallFilter was set, and subsequently cleared, the no_new_privileges flag
was not reset properly. We don't need to set this flag here, it will be
set automatically in unit_patch_contexts() if syscall_filter is set.
2016-10-22 23:42:34 -04:00
Lukas Nykryn 87a47f99bc failure-action: generalize failure action to emergency action 2016-10-21 15:13:50 +02:00
Luca Bruno 52c239d770 core/exec: add a named-descriptor option ("fd") for streams (#4179)
This commit adds a `fd` option to `StandardInput=`,
`StandardOutput=` and `StandardError=` properties in order to
connect standard streams to externally named descriptors provided
by some socket units.

This option looks for a file descriptor named as the corresponding
stream. Custom names can be specified, separated by a colon.
If multiple name-matches exist, the first matching fd will be used.
2016-10-17 20:05:49 -04:00
Zbigniew Jędrzejewski-Szmek 3b319885c4 tree-wide: introduce free_and_replace helper
It's a common pattern, so add a helper for it. A macro is necessary
because a function that takes a pointer to a pointer would be type specific,
similarly to cleanup functions. Seems better to use a macro.
2016-10-16 23:35:39 -04:00
Zbigniew Jędrzejewski-Szmek 3ccb886283 Allow block and char classes in DeviceAllow bus properties (#4353)
Allowed paths are unified betwen the configuration file parses and the bus
property checker. The biggest change is that the bus code now allows "block-"
and "char-" classes. In addition, path_startswith("/dev") was used in the bus
code, and startswith("/dev") was used in the config file code. It seems
reasonable to use path_startswith() which allows a slightly broader class of
strings.

Fixes #3935.
2016-10-12 11:12:11 +02:00
Lennart Poettering cf08b48642 core: introduce MemorySwapMax= (#3659)
Similar to MemoryMax=, MemorySwapMax= limits swap usage. This controls
controls "memory.swap.max" attribute in unified cgroup.
2016-08-31 12:28:54 +02:00
WaLyong Cho 96e131ea09 core: introduce MemorySwapMax=
Similar to MemoryMax=, MemorySwapMax= limits swap usage. This controls
controls "memory.swap.max" attribute in unified cgroup.
2016-08-30 11:11:45 +09:00
Douglas Christman 2507992f6b load-fragment: Resolve specifiers in OnCalendar and On*Sec
Resolves #3534
2016-08-26 12:13:16 -04:00
Tejun Heo f50582649f logind: update empty and "infinity" handling for [User]TasksMax (#3835)
The parsing functions for [User]TasksMax were inconsistent.  Empty string and
"infinity" were interpreted as no limit for TasksMax but not accepted for
UserTasksMax.  Update them so that they're consistent with other knobs.

* Empty string indicates the default value.
* "infinity" indicates no limit.

While at it, replace opencoded (uint64_t) -1 with CGROUP_LIMIT_MAX in TasksMax
handling.

v2: Update empty string to indicate the default value as suggested by Zbigniew
    Jędrzejewski-Szmek.

v3: Fixed empty UserTasksMax handling.
2016-08-18 22:57:53 -04:00
Tejun Heo 66ebf6c0a1 core: add cgroup CPU controller support on the unified hierarchy
Unfortunately, due to the disagreements in the kernel development community,
CPU controller cgroup v2 support has not been merged and enabling it requires
applying two small out-of-tree kernel patches.  The situation is explained in
the following documentation.

 https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu

While it isn't clear what will happen with CPU controller cgroup v2 support,
there are critical features which are possible only on cgroup v2 such as
buffered write control making cgroup v2 essential for a lot of workloads.  This
commit implements systemd CPU controller support on the unified hierarchy so
that users who choose to deploy CPU controller cgroup v2 support can easily
take advantage of it.

On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max"
replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us".  [Startup]CPUWeight config
options are added with the usual compat translation.  CPU quota settings remain
unchanged and apply to both legacy and unified hierarchies.

v2: - Error in man page corrected.
    - CPU config application in cgroup_context_apply() refactored.
    - CPU accounting now works on unified hierarchy.
2016-08-07 09:45:39 -04:00
Lennart Poettering 41bf0590cc util-lib: unify parsing of nice level values
This adds parse_nice() that parses a nice level and ensures it is in the right
range, via a new nice_is_valid() helper. It then ports over a number of users
to this.

No functional changes.
2016-08-05 11:18:32 +02:00
David Michael 5124866d73 util-lib: add parse_percent_unbounded() for percentages over 100% (#3886)
This permits CPUQuota to accept greater values as documented.
2016-08-04 13:09:54 +02:00
Zbigniew Jędrzejewski-Szmek dadd6ecfa5 Merge pull request #3728 from poettering/dynamic-users 2016-07-25 16:40:26 -04:00
Lennart Poettering 43eb109aa9 core: change ExecStart=! syntax to ExecStart=+ (#3797)
As suggested by @mbiebl we already use the "!" special char in unit file
assignments for negation, hence we should not use it in a different context for
privileged execution. Let's use "+" instead.
2016-07-25 16:53:33 +02:00
Lennart Poettering 66dccd8d85 core: be stricter when parsing User=/Group= fields
Let's verify the validity of the syntax of the user/group names set.
2016-07-22 15:53:45 +02:00
Lennart Poettering b3785cd5e6 core: check for overflow when handling scaled MemoryLimit= settings
Just in case...
2016-07-22 15:33:13 +02:00
Lennart Poettering 83f8e80857 core: support percentage specifications on TasksMax=
This adds support for a TasksMax=40% syntax for specifying values relative to
the system's configured maximum number of processes. This is useful in order to
neatly subdivide the available room for tasks within containers.
2016-07-22 15:33:12 +02:00
Luca Bruno 391b81cd03 seccomp: only abort on syscall name resolution failures (#3701)
seccomp_syscall_resolve_name() can return a mix of positive and negative
(pseudo-) syscall numbers, while errors are signaled via __NR_SCMP_ERROR.
This commit lets the syscall filter parser only abort on real parsing
failures, letting libseccomp handle pseudo-syscall number on its own
and allowing proper multiplexed syscalls filtering.
2016-07-12 11:55:26 +02:00
Torstein Husebø 61233823aa treewide: fix typos and remove accidental repetition of words 2016-07-11 16:18:43 +02:00
Lennart Poettering 616aab6085 Merge pull request #3481 from poettering/relative-memcg
various changes, most importantly regarding memory metrics
2016-06-16 13:56:23 +02:00
Zbigniew Jędrzejewski-Szmek a1feacf77f load-fragment: ignore ENOTDIR/EACCES errors (#3510)
If for whatever reason the file system is "corrupted", we want
to be resilient and ignore the error, as long as we can load the units
from a different place.

Arch bug https://bugs.archlinux.org/task/49547.

A user had an ntfs symlink (essentially a file) instead of a directory after
restoring from backup. We should just ignore that like we would treat a missing
directory, for general resiliency.

We should treat permission errors similarly. For example an unreadable
/usr/local/lib directory would prevent (user) instances of systemd from
loading any units. It seems better to continue.
2016-06-15 23:02:27 +02:00
Lennart Poettering d8cf2ac79b util: introduce physical_memory_scale() to unify how we scale by physical memory
The various bits of code did the scaling all different, let's unify this,
given that the code is not trivial.
2016-06-14 20:01:45 +02:00
Lennart Poettering 875ae5661a core: optionally, accept a percentage value for MemoryLimit= and related settings
If a percentage is used, it is taken relative to the installed RAM size. This
should make it easier to write generic unit files that adapt to the local system.
2016-06-14 19:50:38 +02:00
Lennart Poettering 9184ca48ea util-lib: introduce parse_percent() for parsing percent specifications
And port a couple of users over to it.
2016-06-14 19:50:38 +02:00
Alessandro Puccetti cf677fe686 core/execute: add the magic character '!' to allow privileged execution (#3493)
This patch implements the new magic character '!'. By putting '!' in front
of a command, systemd executes it with full privileges ignoring paramters
such as User, Group, SupplementaryGroups, CapabilityBoundingSet,
AmbientCapabilities, SecureBits, SystemCallFilter, SELinuxContext,
AppArmorProfile, SmackProcessLabel, and RestrictAddressFamilies.

Fixes partially https://github.com/systemd/systemd/issues/3414
Related to https://github.com/coreos/rkt/issues/2482

Testing:
1. Create a user 'bob'
2. Create the unit file /etc/systemd/system/exec-perm.service
   (You can use the example below)
3. sudo systemctl start ext-perm.service
4. Verify that the commands starting with '!' were not executed as bob,
   4.1 Looking to the output of ls -l /tmp/exec-perm
   4.2 Each file contains the result of the id command.

`````````````````````````````````````````````````````````````````
[Unit]
Description=ext-perm

[Service]
Type=oneshot
TimeoutStartSec=0
User=bob
ExecStartPre=!/usr/bin/sh -c "/usr/bin/rm /tmp/exec-perm*" ;
    /usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-pre"
ExecStart=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start" ;
    !/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-star-2"
ExecStartPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-start-post"
ExecReload=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-reload"
ExecStop=!/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop"
ExecStopPost=/usr/bin/sh -c "/usr/bin/id > /tmp/exec-perm-stop-post"

[Install]
WantedBy=multi-user.target]
`````````````````````````````````````````````````````````````````
2016-06-10 18:19:54 +02:00
Lennart Poettering 9d3e340639 load-fragment: don't try to do a template instance replacement if we are not an instance (#3451)
Corrects: 7aad67e7

Fixes: #3438
2016-06-09 10:49:36 +02:00
Tejun Heo e57c9ce169 core: always use "infinity" for no upper limit instead of "max" (#3417)
Recently added cgroup unified hierarchy support uses "max" in configurations
for no upper limit.  While consistent with what the kernel uses for no upper
limit, it is inconsistent with what systemd uses for other controllers such as
memory or pids.  There's no point in introducing another term.  Update cgroup
unified hierarchy support so that "infinity" is the only term that systemd
uses for no upper limit.
2016-06-03 17:49:05 +02:00
Topi Miettinen 201c1cc22a core: add pre-defined syscall groups to SystemCallFilter= (#3053) (#3157)
Implement sets of system calls to help constructing system call
filters. A set starts with '@' to distinguish from a system call.

Closes: #3053, #3157
2016-06-01 11:56:01 +02:00
Tejun Heo da4d897e75 core: add cgroup memory controller support on the unified hierarchy (#3315)
On the unified hierarchy, memory controller implements three control knobs -
low, high and max which enables more useable and versatile control over memory
usage.  This patch implements support for the three control knobs.

* MemoryLow, MemoryHigh and MemoryMax are added for memory.low, memory.high and
  memory.max, respectively.

* As all absolute limits on the unified hierarchy use "max" for no limit, make
  memory limit parse functions accept "max" in addition to "infinity" and
  document "max" for the new knobs.

* Implement compatibility translation between MemoryMax and MemoryLimit.

v2:

- Fixed missing else's in config_parse_memory_limit().
- Fixed missing newline when writing out drop-ins.
- Coding style updates to use "val > 0" instead of "val".
- Minor updates to documentation.
2016-05-27 18:10:18 +02:00
Tejun Heo 979d03117f core: update CGroupBlockIODeviceBandwidth to record both rbps and wbps
CGroupBlockIODeviceBandwith is used to keep track of IO bandwidth limits for
legacy cgroup hierarchies.  Unlike the unified hierarchy counterpart
CGroupIODeviceLimit, a CGroupBlockIODeviceBandwiddth records either a read or
write limit and has a couple issues.

* There's no way to clear specific config entry.

* When configs are cleared for an IO direction of a unit, the kernel settings
  aren't cleared accordingly creating discrepancies.

This patch updates CGroupBlockIODeviceBandwidth so that it behaves similarly to
CGroupIODeviceLimit - each entry records both rbps and wbps limits and is
cleared if both are at default values after kernel settings are updated.
2016-05-18 13:51:46 -07:00
Tejun Heo 9be572497d core: introduce CGroupIOLimitType enums
Currently, there are two cgroup IO limits, bandwidth max for read and write,
and they are hard-coded in various places.  This is fine for two limits but IO
is expected to grow more limits - low, high and max limits for bandwidth and
IOPS - and hard-coding each limit won't make sense.

This patch replaces hard-coded limits with an array indexed by
CGroupIOLimitType and accompanying string and default value tables so that new
limits can be added trivially.
2016-05-18 13:50:56 -07:00
Lennart Poettering 3103459e90 Merge pull request #3193 from htejun/cgroup-io-controller
core: add io controller support on the unified hierarchy
2016-05-16 22:05:27 +02:00
Lennart Poettering d31645adef tree-wide: port more code to use ifname_valid() 2016-05-09 15:45:31 +02:00
Tejun Heo 13c31542cc core: add io controller support on the unified hierarchy
On the unified hierarchy, blkio controller is renamed to io and the interface
is changed significantly.

* blkio.weight and blkio.weight_device are consolidated into io.weight which
  uses the standardized weight range [1, 10000] with 100 as the default value.

* blkio.throttle.{read|write}_{bps|iops}_device are consolidated into io.max.
  Expansion of throttling features is being worked on to support
  work-conserving absolute limits (io.low and io.high).

* All stats are consolidated into io.stats.

This patchset adds support for the new interface.  As the interface has been
revamped and new features are expected to be added, it seems best to treat it
as a separate controller rather than trying to expand the blkio settings
although we might add automatic translation if only blkio settings are
specified.

* io.weight handling is mostly identical to blkio.weight[_device] handling
  except that the weight range is different.

* Both read and write bandwidth settings are consolidated into
  CGroupIODeviceLimit which describes all limits applicable to the device.
  This makes it less painful to add new limits.

* "max" can be used to specify the maximum limit which is equivalent to no
  config for max limits and treated as such.  If a given CGroupIODeviceLimit
  doesn't contain any non-default configs, the config struct is discarded once
  the no limit config is applied to cgroup.

* lookup_blkio_device() is renamed to lookup_block_device().

Signed-off-by: Tejun Heo <htejun@fb.com>
2016-05-05 16:43:06 -04:00
Zbigniew Jędrzejewski-Szmek a82394c889 Merge pull request #2921 from keszybz/do-not-report-masked-units-as-changed 2016-05-03 14:08:39 -04:00
Zbigniew Jędrzejewski-Szmek d43bbb52de Revert "Do not report masked units as changed (#2921)"
This reverts commit 6d10d308c6.

It got squashed by mistake.
2016-05-03 14:08:23 -04:00
Zbigniew Jędrzejewski-Szmek 8a993b61d1 Move no_alias information to shared/
This way it can be used in install.c in subsequent commit.
2016-05-01 19:40:51 -04:00
Lennart Poettering a837f08803 core: when encountering a symlink alias for non-aliasable units warn nicely
If the user defines a symlink alias for a unit whose type does not support
aliasing, detect this early and print a nice warning.

Fixe: #2730
2016-04-29 18:06:12 +02:00
Martin Pitt 025ef1d226 Merge pull request #2973 from poettering/search-path
Many fixes, in particular to the install logic
2016-04-12 18:20:13 +02:00
Nicolas Braud-Santoni 1116e14c49 load-fragment: Resolve specifiers in DeviceAllow (#3019)
Closes #1602
2016-04-12 18:00:19 +02:00
Lennart Poettering 463d0d1569 core: remove ManagerRunningAs enum
Previously, we had two enums ManagerRunningAs and UnitFileScope, that were
mostly identical and converted from one to the other all the time. The latter
had one more value UNIT_FILE_GLOBAL however.

Let's simplify things, and remove ManagerRunningAs and replace it by
UnitFileScope everywhere, thus making the translation unnecessary. Introduce
two new macros MANAGER_IS_SYSTEM() and MANAGER_IS_USER() to simplify checking
if we are running in one or the user context.
2016-04-12 13:43:30 +02:00
Lennart Poettering a3c4eb0710 core: rework generator dir logic, move the dirs into LookupPaths structure
A long time ago – when generators where first introduced – the directories for
them were randomly created via mkdtemp(). This was changed later so that they
use fixed name directories now. Let's make use of this, and add the genrator
dirs to the LookupPaths structure and into the unit file search path maintained
in it. This has the benefit that the generator dirs are now normal part of the
search path for all tools, and thus are shown in "systemctl list-unit-files"
too.
2016-04-12 13:43:29 +02:00
Zbigniew Jędrzejewski-Szmek 6d10d308c6 Do not report masked units as changed (#2921)
* core/unit: extract checking of stat paths into helper function

The same code was repeated three times.

* core: treat masked files as "unchanged"

systemctl prints the "unit file changed on disk" warning
for a masked unit. I think it's better to print nothing in that
case.

When a masked unit is loaded, set mtime as 0. When checking
if a unit with mtime of 0 needs reload, check that the mask
is still in place.

* test-dnssec: fix build without gcrypt

Also reorder the test functions to follow the way they are called
from main().
2016-04-12 11:10:57 +02:00
Zbigniew Jędrzejewski-Szmek 3a8db9fe81 core: treat masked files as "unchanged"
systemctl prints the "unit file changed on disk" warning
for a masked unit. I think it's better to print nothing in that
case.

When a masked unit is loaded, set mtime as 0. When checking
if a unit with mtime of 0 needs reload, check that the mask
is still in place.
2016-03-31 00:38:50 -04:00
Michal Sekletar 7aad67e7f2 core: look for instance when processing template name
If first attempt to merge units failed and we are trying to do
merge the other way around and at the same time we are working with
template name, then other unit can't possibly be template, because it is
not possible to have template unit running, only instances of the
template. Thus we need to look for already active instance instead.
2016-03-16 15:40:14 +01:00
Vito Caputo 9ed794a32d tree-wide: minor formatting inconsistency cleanups 2016-02-23 14:20:34 -08:00
Vito Caputo 313cefa1d9 tree-wide: make ++/-- usage consistent WRT spacing
Throughout the tree there's spurious use of spaces separating ++ and --
operators from their respective operands.  Make ++ and -- operator
consistent with the majority of existing uses; discard the spaces.
2016-02-22 20:32:04 -08:00