Commit Graph

667 Commits

Author SHA1 Message Date
Lennart Poettering 065b47749d tree-wide: use ERRNO_IS_PRIVILEGE() whereever appropriate 2020-09-22 16:25:22 +02:00
Topi Miettinen 9df2cdd8ec exec: SystemCallLog= directive
With new directive SystemCallLog= it's possible to list system calls to be
logged. This can be used for auditing or temporarily when constructing system
call filters.

---
v5: drop intermediary, update HASHMAP_FOREACH_KEY() use
v4: skip useless debug messages, actually parse directive
v3: don't declare unused variables with old libseccomp
v2: fix build without seccomp or old libseccomp
2020-09-15 12:54:17 +03:00
Topi Miettinen 005bfaf118 exec: Add kill action to system call filters
Define explicit action "kill" for SystemCallErrorNumber=.

In addition to errno code, allow specifying "kill" as action for
SystemCallFilter=.

---
v7: seccomp_parse_errno_or_action() returns -EINVAL if !HAVE_SECCOMP
v6: use streq_ptr(), let errno_to_name() handle bad values, kill processes,
 init syscall_errno
v5: actually use seccomp_errno_or_action_to_string(), don't fail bus unit
parsing without seccomp
v4: fix build without seccomp
v3: drop log action
v2: action -> number
2020-09-15 12:54:17 +03:00
Frantisek Sumsal 69e3234db7 tree-wide: fix typos found by codespell
Reported by Fossies.org
2020-09-14 15:32:37 +02:00
Tobias Kaufmann 198dc17845 core: fix set keep caps for ambient capabilities
The securebit keep-caps retains the capabilities in the permitted set
over an UID change (ambient capabilities are cleared though).

Setting the keep-caps securebit after the uid change and before execve
doesn't make sense as it is cleared during execve and there is no
additional user ID change after this point.

Altough the documentation (man 7 capabilities) is ambigious, keep-caps
is reset during execve although keep-caps-locked is set. After execve
only keep-caps-locked is set and keep-caps is cleared.
2020-09-09 11:17:42 +02:00
Tobias Kaufmann 16fcb1918a core: fix comments on ambient capabilities
The comments on the code for ambient capabilities was wrong/outdated.
2020-09-09 11:17:42 +02:00
Lennart Poettering f3f4abad29
Merge pull request #16979 from keszybz/return-log-debug
Fix 'return log_error()' and 'return log_warning()' patterns
2020-09-08 19:54:38 +02:00
Lennart Poettering c6552f7cd5
Merge pull request #16955 from keszybz/test-execute-cleanup
One patch for test-execute and assorted cleanups
2020-09-08 18:33:12 +02:00
Zbigniew Jędrzejewski-Szmek c413bb28df tree-wide: correct cases where return log_{error,warning} is used without value
In various cases, we would say 'return log_warning()' or 'return log_error()'. Those
functions return 0 if no error is passed in. For log_warning or log_error this doesn't
make sense, and we generally want to propagate the error. In the few cases where
the error should be ignored, I think it's better to split it in two, and call 'return 0'
on a separate line.
2020-09-08 17:40:46 +02:00
Zbigniew Jędrzejewski-Szmek 90e74a66e6 tree-wide: define iterator inside of the macro 2020-09-08 12:14:05 +02:00
Zbigniew Jędrzejewski-Szmek 5b10116e49 core/{execute, manager}: reduce scope of iterator variables a bit 2020-09-04 18:14:26 +02:00
Lennart Poettering 7cc60ea414
Merge pull request #16821 from cgzones/selinux_status
selinux: use SELinux status page
2020-09-03 14:55:08 +02:00
Tobias Kaufmann dbdc4098f6 core: fix securebits setting
Desired functionality:
Set securebits for services started as non-root user.

Failure:
The starting of the service fails if no ambient capability shall be
raised.
... systemd[217941]: ...: Failed to set process secure bits: Operation
not permitted
... systemd[217941]: ...: Failed at step SECUREBITS spawning
/usr/bin/abc.service: Operation not permitted
... systemd[1]: abc.service: Failed with result 'exit-code'.

Reason:
For setting securebits the capability CAP_SETPCAP is required. However
the securebits (if no ambient capability shall be raised) are set after
setresuid.
When setresuid is invoked all capabilities are dropped from the
permitted, effective and ambient capability set. If the securebit
SECBIT_KEEP_CAPS is set the permitted capability set is retained, but
the effective and the ambient set are cleared.

If ambient capabilities shall be set, the securebit SECBIT_KEEP_CAPS is
added to the securebits configured in the service file and set together
with the securebits from the service file before setresuid is executed
(in enforce_user).
Before setresuid is executed the capabilities are the same as for pid1.
This means that all capabilities in the effective, permitted and
bounding set are set. Thus the capability CAP_SETPCAP is in the
effective set and the prctl(PR_SET_SECUREBITS, ...) succeeds.
However, if the secure bits aren't set before setresuid is invoked they
shall be set shortly after the uid change in enforce_user.
This fails as SECBIT_KEEP_CAPS wasn't set before setresuid and in
consequence the effective and permitted set was cleared, hence
CAP_SETPCAP is not set in the effective set (and cannot be raised any
longer) and prctl(PR_SET_SECUREBITS, ...) failes with EPERM.

Proposed solution:
The proposed solution consists of three parts
1. Check in enforce_user, if securebits are configured in the service
   file. If securebits are configured, set SECBIT_KEEP_CAPS
   before invoking setresuid.
2. Don't set any other securebits than SECBIT_KEEP_CAPS in enforce_user,
   but set all requested ones after enforce_user.
   This has the advantage that securebits are set at the same place for
   root and non-root services.
3. Raise CAP_SETPCAP to the effective set (if not already set) before
   setting the securebits to avoid EPERM during the prctl syscall.

For gaining CAP_SETPCAP the function capability_bounding_set_drop is
splitted into two functions:
- The first one raises CAP_SETPCAP (required for dropping bounding
  capabilities)
- The second drops the bounding capabilities

Why are ambient capabilities not affected by this change?
Ambient capabilities get cleared during setresuid, no matter if
SECBIT_KEEP_CAPS is set or not.
For raising ambient capabilities for a user different to root, the
requested capability has to be raised in the inheritable set first. Then
the SECBIT_KEEP_CAPS securebit needs to be set before setresuid is
invoked. Afterwards the ambient capability can be raised, because it is
in the inheritable and permitted set.

Security considerations:
Although the manpage is ambiguous SECBIT_KEEP_CAPS is cleared during
execve no matter if SECBIT_KEEP_CAPS_LOCKED is set or not. If both are
set only SECBIT_KEEP_CAPS_LOCKED is set after execve.
Setting SECBIT_KEEP_CAPS in enforce_user for being able to set
securebits is no security risk, as the effective and permitted set are
set to the value of the ambient set during execve (if the executed file
has no file capabilities. For details check man 7 capabilities).

Remark:
In capability-util.c is a comment complaining about the missing
capability CAP_SETPCAP in the effective set, after the kernel executed
/sbin/init. Thus it is checked there if this capability has to be raised
in the effective set before dropping capabilities from the bounding set.
If this were true all the time, ambient capabilities couldn't be set
without dropping at least one capability from the bounding set, as the
capability CAP_SETPCAP would miss and setting SECBIT_KEEP_CAPS would
fail with EPERM.
2020-09-01 10:53:26 +02:00
Lennart Poettering b519529104
Merge pull request #16841 from keszybz/acl-util-bitmask
Use a bitmask in fd_add_uid_acl_permission()
2020-08-31 16:45:13 +02:00
Yu Watanabe 8062e643e6 core: clear bind mounts on error
Follow-up for bbb4e7f39f.

Fixes CID#1431998.
2020-08-27 18:20:34 +09:00
Christian Göttsche 2df2152c20 selinux: fork label-aware children with up-to-date label database
The parent process may not perform any label operation, so the
database might not get updated on a SELinux policy change on its own.

Reload the label database once on a policy change, instead of n times
in every started child.
2020-08-27 10:28:53 +02:00
Zbigniew Jędrzejewski-Szmek 567aeb5801 shared/acl-util: convert rd,wr,ex to a bitmask
I find this version much more readable.

Add replacement defines so that when acl/libacl.h is not available, the
ACL_{READ,WRITE,EXECUTE} constants are also defined. Those constants were
declared in the kernel headers already in 1da177e4c3f41524e886b7f1b8a0c1f,
so they should be the same pretty much everywhere.
2020-08-27 10:20:12 +02:00
PhoenixDiscord e8607daf7d
Replace gendered pronouns with gender neutral ones. (#16844) 2020-08-27 11:52:48 +09:00
Lennart Poettering bbb4e7f39f core: hide /run/credentials whenever namespacing is requested
Ideally we would like to hide all other service's credentials for all
services. That would imply for us to enable mount namespacing for all
services, which is something we cannot do, both due to compatibility
with the status quo ante, and because a number of services legitimately
should be able to install mounts in the host hierarchy.

Hence we do the second best thing, we hide the credentials automatically
for all services that opt into mount namespacing otherwise. This is
quite different from other mount sandboxing options: usually you have to
explicitly opt into each. However, given that the credentials logic is a
brand new concept we invented right here and now, and particularly
security sensitive it's OK to reverse this, and by default hide
credentials whenever we can (i.e. whenever mount namespacing is
otherwise opt-ed in to).

Long story short: if you want to hide other service's credentials, the
most basic options is to just turn on PrivateMounts= and there you go,
they should all be gone.
2020-08-25 19:45:38 +02:00
Lennart Poettering bb0c0d6f29 core: add credentials logic
Fixes: #15778 #16060
2020-08-25 19:45:35 +02:00
Lennart Poettering 4e39995371 core: introduce ProtectProc= and ProcSubset= to expose hidepid= and subset= procfs mount options
Kernel 5.8 gained a hidepid= implementation that is truly per procfs,
which allows us to mount a distinct once into every unit, with
individual hidepid= settings. Let's expose this via two new settings:
ProtectProc= (wrapping hidpid=) and ProcSubset= (wrapping subset=).

Replaces: #11670
2020-08-24 20:11:02 +02:00
Lennart Poettering 52b3d6523f namespace: move protect_{home|system} into NamespaceInfo
it's not entirely clear what shall be passed via parameter and what via
struct, but these two definitely fit well with the other protect_xyz
fields, hence let's move them over.

We probably should move a lot more more fields into the structure
actuall (most? all even?).
2020-08-24 20:10:30 +02:00
Luca Boccassi 427353f668 core: add mount options support for MountImages
Follow the same model established for RootImage and RootImageOptions,
and allow to either append a single list of options or tuples of
partition_number:options.
2020-08-20 14:45:40 +01:00
Luca Boccassi 9ece644435 core: change RootImageOptions to use names instead of partition numbers
Follow the designations from the Discoverable Partitions Specification
2020-08-20 13:58:02 +01:00
Luca Boccassi b3d133148e core: new feature MountImages
Follows the same pattern and features as RootImage, but allows an
arbitrary mount point under / to be specified by the user, and
multiple values - like BindPaths.

Original implementation by @topimiettinen at:
https://github.com/systemd/systemd/pull/14451
Reworked to use dissect's logic instead of bare libmount() calls
and other review comments.
Thanks Topi for the initial work to come up with and implement
this useful feature.
2020-08-05 21:34:55 +01:00
Luca Boccassi 18d7370587 service: add new RootImageOptions feature
Allows to specify mount options for RootImage.
In case of multi-partition images, the partition number can be prefixed
followed by colon. Eg:

RootImageOptions=1:ro,dev 2:nosuid nodev

In absence of a partition number, 0 is assumed.
2020-07-29 17:17:32 +01:00
Lennart Poettering c3f8a065e9 execute: take ownership of more fields in ExecParameters
Let's simplify things a bit, and take ownership of more fields in
ExecParameters, so that they are automatically freed when the structure
is released.
2020-07-23 08:37:21 +02:00
Lennart Poettering f63ef93703 execute: fix if check
Fixes: coverity 1430459
2020-07-16 08:35:18 +02:00
Lennart Poettering 8d5bb13d78 core: fix invalid assertion
We miscounted here, and would hit an assert once too early.
2020-07-16 09:13:04 +09:00
Zbigniew Jędrzejewski-Szmek 56a13a495c pid1: create ro private tmp dirs when /tmp or /var/tmp is read-only
Read-only /var/tmp is more likely, because it's backed by a real device. /tmp
is (by default) backed by tmpfs, but it doesn't have to be. In both cases the
same consideration applies.

If we boot with read-only /var/tmp, any unit with PrivateTmp=yes would fail
because we cannot create the subdir under /var/tmp to mount the private directory.
But many services actually don't require /var/tmp (either because they only use
it occasionally, or because they only use /tmp, or even because they don't use the
temporary directories at all, and PrivateTmp=yes is used to isolate them from
the rest of the system).

To handle both cases let's create a read-only directory under /run/systemd and
mount it as the private /tmp or /var/tmp. (Read-only to not fool the service into
dumping too much data in /run.)

$ sudo systemd-run -t -p PrivateTmp=yes bash
Running as unit: run-u14.service
Press ^] three times within 1s to disconnect TTY.
[root@workstation /]# ls -l /tmp/
total 0
[root@workstation /]# ls -l /var/tmp/
total 0
[root@workstation /]# touch /tmp/f
[root@workstation /]# touch /var/tmp/f
touch: cannot touch '/var/tmp/f': Read-only file system

This commit has more changes than I like to put in one commit, but it's touching all
the same paths so it's hard to split.
exec_runtime_make() was using the wrong cleanup function, so the directory would be
left behind on error.
2020-07-14 19:47:15 +02:00
Zbigniew Jędrzejewski-Szmek 37b22b3b47 tree: wide "the the" and other trivial grammar fixes 2020-07-02 09:51:38 +02:00
Luca Boccassi d4d55b0d13 core: add RootHashSignature service parameter
Allow to explicitly pass root hash signature as a unit option. Takes precedence
over implicit checks.
2020-06-25 08:45:21 +01:00
Lennart Poettering 6b000af4f2 tree-wide: avoid some loaded terms
https://tools.ietf.org/html/draft-knodel-terminology-02
https://lwn.net/Articles/823224/

This gets rid of most but not occasions of these loaded terms:

1. scsi_id and friends are something that is supposed to be removed from
   our tree (see #7594)

2. The test suite defines an API used by the ubuntu CI. We can remove
   this too later, but this needs to be done in sync with the ubuntu CI.

3. In some cases the terms are part of APIs we call or where we expose
   concepts the kernel names the way it names them. (In particular all
   remaining uses of the word "slave" in our codebase are like this,
   it's used by the POSIX PTY layer, by the network subsystem, the mount
   API and the block device subsystem). Getting rid of the term in these
   contexts would mean doing some major fixes of the kernel ABI first.

Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
2020-06-25 09:00:19 +02:00
Luca Boccassi 0389f4fa81 core: add RootHash and RootVerity service parameters
Allow to explicitly pass root hash (explicitly or as a file) and verity
device/file as unit options. Take precedence over implicit checks.
2020-06-23 10:50:09 +02:00
Lennart Poettering f3dc6af20f core: automatically update StandardOuput=syslog to =journal (and similar for StandardError=)
Let's go one step further and upgrade implicitly. Usually =syslog
assignments are historic artifacts only. Let's upgrade the lines
automatically, and politely suggest people update their unit
files/configuration (and drop the lines altogether, without
replacement).

Fixes: #15807
2020-05-15 00:05:46 +02:00
Zbigniew Jędrzejewski-Szmek 8f3e342fa9 core: fix unused variable warning when !HAVE_SECCOMP 2020-04-23 14:42:09 +02:00
Lennart Poettering e8cf09b2a2 core: make sure we don't get confused when setting TERM for a tty fd
Fixes: #15344
2020-04-22 22:59:41 +02:00
Frantisek Sumsal 86b52a3958 tree-wide: fix spelling errors
Based on a report from Fossies.org using Codespell.

Followup to #15436
2020-04-21 23:21:08 +02:00
Lennart Poettering daf8f72b4e core: make sure ProtectHostname= is handled gracefully in containers lacking seccomp
Fixes: #15408
2020-04-13 17:32:27 +02:00
Zbigniew Jędrzejewski-Szmek ad21e542b2 manager: add CoredumpFilter= setting
Fixes #6685.
2020-04-09 14:08:48 +02:00
Michal Sekletár e2b2fb7f56 core: add support for setting CPUAffinity= to special "numa" value
systemd will automatically derive CPU affinity mask from NUMA node
mask.

Fixes #13248
2020-03-16 08:57:28 +01:00
Topi Miettinen efa2f3a18b
execute: don't create /tmp and /var/tmp if both are inaccessible
If both /tmp and either /var/tmp or whole /var are inaccessible, there's no
need to create the temporary directories.
2020-03-10 16:51:29 +02:00
Zbigniew Jędrzejewski-Szmek 908055f61f
Merge pull request #15033 from yuwata/state-directory-migrate-issue
execute: Fix migration from DynamicUser=yes to no
2020-03-09 17:34:55 +01:00
Yu Watanabe 578dc69f2a execute: Fix migration from DynamicUser=yes to no
Closes #12131.
2020-03-06 21:02:26 +09:00
Zbigniew Jędrzejewski-Szmek 44e5d00603 pid1: remove unnecessary terminator
We specify the number of items as the first argument already.
2020-03-05 08:13:49 +01:00
Yu Watanabe dd0395b565 make namespace_flags_to_string() not return empty string
This improves the following debug log.

Before:
systemd[1162]: Restricting namespace to: .

After:
systemd[1162]: Restricting namespace to: n/a.
2020-03-03 21:17:38 +01:00
Zbigniew Jędrzejewski-Szmek 86fca584c3 core/execute: use return value from sockaddr_un_set_path(), remove duplicate check 2020-03-02 15:56:30 +01:00
Zbigniew Jędrzejewski-Szmek f36a9d5909 tree-wide: use the return value from sockaddr_un_set_path()
It fully initializes the address structure, so no need for pre-initialization,
and also returns the length of the address, so no need to recalculate using
SOCKADDR_UN_LEN().

socklen_t is unsigned, so let's not use an int for it. (It doesn't matter, but
seems cleaner and more portable to not assume anything about the type.)
2020-03-02 15:55:44 +01:00
Zbigniew Jędrzejewski-Szmek ee00d1e95e pid1: do not fail if we get EPERM while setting up network name
In a user namespace container:
Feb 28 12:45:53 0b2420135953 systemd[1]: Starting Home Manager...
Feb 28 12:45:53 0b2420135953 systemd[21]: systemd-homed.service: Failed to set up network namespacing: Operation not permitted
Feb 28 12:45:53 0b2420135953 systemd[21]: systemd-homed.service: Failed at step NETWORK spawning /usr/lib/systemd/systemd-homed: Operation not permitted
Feb 28 12:45:53 0b2420135953 systemd[1]: systemd-homed.service: Main process exited, code=exited, status=225/NETWORK
Feb 28 12:45:53 0b2420135953 systemd[1]: systemd-homed.service: Failed with result 'exit-code'.
Feb 28 12:45:53 0b2420135953 systemd[1]: Failed to start Home Manager.

We should treat this similarly to the case where network namespace are not
supported at all.

https://bugzilla.redhat.com/show_bug.cgi?id=1807465
2020-02-29 19:33:19 +09:00
Nate Jones ecf63c9102 execute: Make '+' exec prefix ignore PrivateTmp=yes
The man pages state that the '+' prefix in Exec* directives should
ignore filesystem namespacing options such as PrivateTmp. Now it does.

This is very similar to #8842, just with PrivateTmp instead of
PrivateDevices.
2020-02-29 19:32:01 +09:00