Commit Graph

692 Commits

Author SHA1 Message Date
Félix Baylac-Jacqué 15d1e0badf
wip 2021-01-10 22:45:49 +01:00
Lennart Poettering 6dd16814a5
Merge pull request #17079 from keszybz/late-exec-resolution
Resolve executable paths before execution, use fexecve()
2020-12-03 14:58:20 +01:00
Lennart Poettering 986311c2da fileio: teach read_full_file_full() to read from offset/with maximum size 2020-12-01 14:17:47 +01:00
Yu Watanabe d85ff94477 core: use SYNTHETIC_ERRNO() macro 2020-11-27 14:35:20 +09:00
Yu Watanabe db9ecf0501 license: LGPL-2.1+ -> LGPL-2.1-or-later 2020-11-09 13:23:58 +09:00
Zbigniew Jędrzejewski-Szmek a6d9111c67 core/execute: fall back to execve() for scripts
fexecve() fails with ENOENT and we need a fallback. Add appropriate test.
2020-11-06 15:14:13 +01:00
Zbigniew Jędrzejewski-Szmek b83d505087 core: use fexecve() to spawn children
We base the smack/selinux setup on the executable. Let's open the file
once and use the same fd for that setup and the subsequent execve.
2020-11-06 15:13:01 +01:00
Zbigniew Jędrzejewski-Szmek 5ca9139ace basic/path-util: let find_executable_full() optionally return an fd 2020-11-06 15:12:54 +01:00
Lennart Poettering d3dcf4e3b9 fileio: beef up READ_FULL_FILE_CONNECT_SOCKET to allow setting sender socket name
This beefs up the READ_FULL_FILE_CONNECT_SOCKET logic of
read_full_file_full() a bit: when used a sender socket name may be
specified. If specified as NULL behaviour is as before: the client
socket name is picked by the kernel. But if specified as non-NULL the
client can pick a socket name to use when connecting. This is useful to
communicate a minimal amount of metainformation from client to server,
outside of the transport payload.

Specifically, these beefs up the service credential logic to pass an
abstract AF_UNIX socket name as client socket name when connecting via
READ_FULL_FILE_CONNECT_SOCKET, that includes the requesting unit name
and the eventual credential name. This allows servers implementing the
trivial credential socket logic to distinguish clients: via a simple
getpeername() it can be determined which unit is requesting a
credential, and which credential specifically.

Example: with this patch in place, in a unit file "waldo.service" a
configuration line like the following:

    LoadCredential=foo:/run/quux/creds.sock

will result in a connection to the AF_UNIX socket /run/quux/creds.sock,
originating from an abstract namespace AF_UNIX socket:

    @$RANDOM/unit/waldo.service/foo

(The $RANDOM is replaced by some randomized string. This is included in
the socket name order to avoid namespace squatting issues: the abstract
socket namespace is open to unprivileged users after all, and care needs
to be taken not to use guessable names)

The services listening on the /run/quux/creds.sock socket may thus
easily retrieve the name of the unit the credential is requested for
plus the credential name, via a simpler getpeername(), discarding the
random preifx and the /unit/ string.

This logic uses "/" as separator between the fields, since both unit
names and credential names appear in the file system, and thus are
designed to use "/" as outer separators. Given that it's a good safe
choice to use as separators here, too avoid any conflicts.

This is a minimal patch only: the new logic is used only for the unit
file credential logic. For other places where we use
READ_FULL_FILE_CONNECT_SOCKET it is probably a good idea to use this
scheme too, but this should be done carefully in later patches, since
the socket names become API that way, and we should determine the right
amount of info to pass over.
2020-11-03 09:48:04 +01:00
Zbigniew Jędrzejewski-Szmek 1da37e58ff core/execute: refactor creation of array with fds to keep during execution
We close fds in two phases, first some and then the some more. When passing
a list of fds to exclude from closing to the closing function, we would
pass some in an array and the rest as separate arguments. For the fds which
should be excluded in both closing phases, let's always create the array
and put the relevant fds there. This has the advantage that if more fds to
exclude in both phases are added later, we don't need to add more positional
arguments.

The list passed to setup_pam() is not changed. I think we could pass more fds
to close there, but I'm leaving that unchanged.

The setting of FD_CLOEXEC on an already open fds is dropped. The fd is opened
in service_allocate_exec_fd() and there is no reason to suspect that it might
have been opened incorrectly. If some rogue code is unsetting our FD_CLOEXEC
bits, then it might flip any fd, no reason to single this one out.
2020-10-14 18:29:25 +02:00
Lennart Poettering 74aaf59b1a execute: make sure some more functions follow coding style
Initialize all return values on success, as our usual coding style
suggests.
2020-10-14 16:41:37 +02:00
Lennart Poettering f5fa352f1e execute: fix single character typo
Corrects: c413bb28df

Fixes: #17313
2020-10-14 16:41:37 +02:00
Frantisek Sumsal d46b79bbe0 tree-wide: drop if braces around single line expressions as well 2020-10-09 15:11:55 +02:00
Lennart Poettering 14eb3285ab execute: use empty_to_root() a bit more 2020-10-01 11:02:11 +02:00
Lennart Poettering 74e1252072 execute: add helper for checking if root_directory/root_image are set in ExecContext 2020-10-01 11:02:11 +02:00
Lennart Poettering 36296ae2ad
Merge pull request #17152 from keszybz/make-mountapivfs-default
Make MountAPIVFS=yes default
2020-10-01 11:00:02 +02:00
Zbigniew Jędrzejewski-Szmek 48904c8bfd core/execute: escape the separator in exported paths
Our paths shouldn't even contain ":", but let's escape it if one somehow sneaks
in.
2020-09-25 13:36:34 +02:00
Zbigniew Jędrzejewski-Szmek d4d9f034b1 basic/strv: allow escaping the separator in strv_join()
The new parameter is false everywhere except for tests, so no functional change
is expected.
2020-09-25 13:36:34 +02:00
Zbigniew Jędrzejewski-Szmek 6119878480 core: turn on MountAPIVFS=true when RootImage or RootDirectory are specified
Lennart wanted to do this back in
01c33c1eff.
For better or worse, this wasn't done because I thought that turning on MountAPIVFS
is a compat break for RootDirectory and people might be negatively surprised by it.
Without this, search for binaries doesn't work (access_fd() requires /proc).
Let's turn it on, but still allow overriding to "no".

When RootDirectory=/, MountAPIVFS=1 doesn't work. This might be a buglet on its
own, but this patch doesn't change the situation.
2020-09-24 10:03:18 +02:00
Zbigniew Jędrzejewski-Szmek 5e98086d16 core: remember when we set ExecContext.mount_apivfs
No functional change intended so far.
2020-09-24 10:03:18 +02:00
Lennart Poettering bcaf20dc38
Merge pull request #17143 from keszybz/late-exec-resolution-alt
Late exec resolution (subset)
2020-09-24 09:38:36 +02:00
Lennart Poettering 21935150a0 tree-wide: switch remaining mount() invocations over to mount_nofollow_verbose()
(Well, at least the ones where that makes sense. Where it does't make
sense are the ones that re invoked on the root path, which cannot
possibly be a symlink.)
2020-09-23 18:57:37 +02:00
Lennart Poettering 065b47749d tree-wide: use ERRNO_IS_PRIVILEGE() whereever appropriate 2020-09-22 16:25:22 +02:00
Zbigniew Jędrzejewski-Szmek 0af07108e4 core/execute: reduce indentation level a bit 2020-09-18 15:28:48 +02:00
Zbigniew Jędrzejewski-Szmek 9f71ba8d95 core: resolve binary names immediately before execution
This has two advantages:
- we save a bit of IO in early boot because we don't look for executables
  which we might never call
- if the executable is in a different place and it was specified as a
  non-absolute path, it is OK if it moves to a different place. This should
  solve the case paths are different in the initramfs.

Since the executable path is only available quite late, the call to
mac_selinux_get_child_mls_label() which uses the path needs to be moved down
too.

Fixes #16076.
2020-09-18 15:28:48 +02:00
Zbigniew Jędrzejewski-Szmek 0706c01259 Add CLOSE_AND_REPLACE helper
Similar to free_and_replace. I think this should be uppercase to make it
clear that this is a macro. free_and_replace should probably be uppercased
too.
2020-09-18 15:28:48 +02:00
Topi Miettinen 9df2cdd8ec exec: SystemCallLog= directive
With new directive SystemCallLog= it's possible to list system calls to be
logged. This can be used for auditing or temporarily when constructing system
call filters.

---
v5: drop intermediary, update HASHMAP_FOREACH_KEY() use
v4: skip useless debug messages, actually parse directive
v3: don't declare unused variables with old libseccomp
v2: fix build without seccomp or old libseccomp
2020-09-15 12:54:17 +03:00
Topi Miettinen 005bfaf118 exec: Add kill action to system call filters
Define explicit action "kill" for SystemCallErrorNumber=.

In addition to errno code, allow specifying "kill" as action for
SystemCallFilter=.

---
v7: seccomp_parse_errno_or_action() returns -EINVAL if !HAVE_SECCOMP
v6: use streq_ptr(), let errno_to_name() handle bad values, kill processes,
 init syscall_errno
v5: actually use seccomp_errno_or_action_to_string(), don't fail bus unit
parsing without seccomp
v4: fix build without seccomp
v3: drop log action
v2: action -> number
2020-09-15 12:54:17 +03:00
Frantisek Sumsal 69e3234db7 tree-wide: fix typos found by codespell
Reported by Fossies.org
2020-09-14 15:32:37 +02:00
Tobias Kaufmann 198dc17845 core: fix set keep caps for ambient capabilities
The securebit keep-caps retains the capabilities in the permitted set
over an UID change (ambient capabilities are cleared though).

Setting the keep-caps securebit after the uid change and before execve
doesn't make sense as it is cleared during execve and there is no
additional user ID change after this point.

Altough the documentation (man 7 capabilities) is ambigious, keep-caps
is reset during execve although keep-caps-locked is set. After execve
only keep-caps-locked is set and keep-caps is cleared.
2020-09-09 11:17:42 +02:00
Tobias Kaufmann 16fcb1918a core: fix comments on ambient capabilities
The comments on the code for ambient capabilities was wrong/outdated.
2020-09-09 11:17:42 +02:00
Lennart Poettering f3f4abad29
Merge pull request #16979 from keszybz/return-log-debug
Fix 'return log_error()' and 'return log_warning()' patterns
2020-09-08 19:54:38 +02:00
Lennart Poettering c6552f7cd5
Merge pull request #16955 from keszybz/test-execute-cleanup
One patch for test-execute and assorted cleanups
2020-09-08 18:33:12 +02:00
Zbigniew Jędrzejewski-Szmek c413bb28df tree-wide: correct cases where return log_{error,warning} is used without value
In various cases, we would say 'return log_warning()' or 'return log_error()'. Those
functions return 0 if no error is passed in. For log_warning or log_error this doesn't
make sense, and we generally want to propagate the error. In the few cases where
the error should be ignored, I think it's better to split it in two, and call 'return 0'
on a separate line.
2020-09-08 17:40:46 +02:00
Zbigniew Jędrzejewski-Szmek 90e74a66e6 tree-wide: define iterator inside of the macro 2020-09-08 12:14:05 +02:00
Zbigniew Jędrzejewski-Szmek 5b10116e49 core/{execute, manager}: reduce scope of iterator variables a bit 2020-09-04 18:14:26 +02:00
Lennart Poettering 7cc60ea414
Merge pull request #16821 from cgzones/selinux_status
selinux: use SELinux status page
2020-09-03 14:55:08 +02:00
Tobias Kaufmann dbdc4098f6 core: fix securebits setting
Desired functionality:
Set securebits for services started as non-root user.

Failure:
The starting of the service fails if no ambient capability shall be
raised.
... systemd[217941]: ...: Failed to set process secure bits: Operation
not permitted
... systemd[217941]: ...: Failed at step SECUREBITS spawning
/usr/bin/abc.service: Operation not permitted
... systemd[1]: abc.service: Failed with result 'exit-code'.

Reason:
For setting securebits the capability CAP_SETPCAP is required. However
the securebits (if no ambient capability shall be raised) are set after
setresuid.
When setresuid is invoked all capabilities are dropped from the
permitted, effective and ambient capability set. If the securebit
SECBIT_KEEP_CAPS is set the permitted capability set is retained, but
the effective and the ambient set are cleared.

If ambient capabilities shall be set, the securebit SECBIT_KEEP_CAPS is
added to the securebits configured in the service file and set together
with the securebits from the service file before setresuid is executed
(in enforce_user).
Before setresuid is executed the capabilities are the same as for pid1.
This means that all capabilities in the effective, permitted and
bounding set are set. Thus the capability CAP_SETPCAP is in the
effective set and the prctl(PR_SET_SECUREBITS, ...) succeeds.
However, if the secure bits aren't set before setresuid is invoked they
shall be set shortly after the uid change in enforce_user.
This fails as SECBIT_KEEP_CAPS wasn't set before setresuid and in
consequence the effective and permitted set was cleared, hence
CAP_SETPCAP is not set in the effective set (and cannot be raised any
longer) and prctl(PR_SET_SECUREBITS, ...) failes with EPERM.

Proposed solution:
The proposed solution consists of three parts
1. Check in enforce_user, if securebits are configured in the service
   file. If securebits are configured, set SECBIT_KEEP_CAPS
   before invoking setresuid.
2. Don't set any other securebits than SECBIT_KEEP_CAPS in enforce_user,
   but set all requested ones after enforce_user.
   This has the advantage that securebits are set at the same place for
   root and non-root services.
3. Raise CAP_SETPCAP to the effective set (if not already set) before
   setting the securebits to avoid EPERM during the prctl syscall.

For gaining CAP_SETPCAP the function capability_bounding_set_drop is
splitted into two functions:
- The first one raises CAP_SETPCAP (required for dropping bounding
  capabilities)
- The second drops the bounding capabilities

Why are ambient capabilities not affected by this change?
Ambient capabilities get cleared during setresuid, no matter if
SECBIT_KEEP_CAPS is set or not.
For raising ambient capabilities for a user different to root, the
requested capability has to be raised in the inheritable set first. Then
the SECBIT_KEEP_CAPS securebit needs to be set before setresuid is
invoked. Afterwards the ambient capability can be raised, because it is
in the inheritable and permitted set.

Security considerations:
Although the manpage is ambiguous SECBIT_KEEP_CAPS is cleared during
execve no matter if SECBIT_KEEP_CAPS_LOCKED is set or not. If both are
set only SECBIT_KEEP_CAPS_LOCKED is set after execve.
Setting SECBIT_KEEP_CAPS in enforce_user for being able to set
securebits is no security risk, as the effective and permitted set are
set to the value of the ambient set during execve (if the executed file
has no file capabilities. For details check man 7 capabilities).

Remark:
In capability-util.c is a comment complaining about the missing
capability CAP_SETPCAP in the effective set, after the kernel executed
/sbin/init. Thus it is checked there if this capability has to be raised
in the effective set before dropping capabilities from the bounding set.
If this were true all the time, ambient capabilities couldn't be set
without dropping at least one capability from the bounding set, as the
capability CAP_SETPCAP would miss and setting SECBIT_KEEP_CAPS would
fail with EPERM.
2020-09-01 10:53:26 +02:00
Lennart Poettering b519529104
Merge pull request #16841 from keszybz/acl-util-bitmask
Use a bitmask in fd_add_uid_acl_permission()
2020-08-31 16:45:13 +02:00
Yu Watanabe 8062e643e6 core: clear bind mounts on error
Follow-up for bbb4e7f39f.

Fixes CID#1431998.
2020-08-27 18:20:34 +09:00
Christian Göttsche 2df2152c20 selinux: fork label-aware children with up-to-date label database
The parent process may not perform any label operation, so the
database might not get updated on a SELinux policy change on its own.

Reload the label database once on a policy change, instead of n times
in every started child.
2020-08-27 10:28:53 +02:00
Zbigniew Jędrzejewski-Szmek 567aeb5801 shared/acl-util: convert rd,wr,ex to a bitmask
I find this version much more readable.

Add replacement defines so that when acl/libacl.h is not available, the
ACL_{READ,WRITE,EXECUTE} constants are also defined. Those constants were
declared in the kernel headers already in 1da177e4c3f41524e886b7f1b8a0c1f,
so they should be the same pretty much everywhere.
2020-08-27 10:20:12 +02:00
PhoenixDiscord e8607daf7d
Replace gendered pronouns with gender neutral ones. (#16844) 2020-08-27 11:52:48 +09:00
Lennart Poettering bbb4e7f39f core: hide /run/credentials whenever namespacing is requested
Ideally we would like to hide all other service's credentials for all
services. That would imply for us to enable mount namespacing for all
services, which is something we cannot do, both due to compatibility
with the status quo ante, and because a number of services legitimately
should be able to install mounts in the host hierarchy.

Hence we do the second best thing, we hide the credentials automatically
for all services that opt into mount namespacing otherwise. This is
quite different from other mount sandboxing options: usually you have to
explicitly opt into each. However, given that the credentials logic is a
brand new concept we invented right here and now, and particularly
security sensitive it's OK to reverse this, and by default hide
credentials whenever we can (i.e. whenever mount namespacing is
otherwise opt-ed in to).

Long story short: if you want to hide other service's credentials, the
most basic options is to just turn on PrivateMounts= and there you go,
they should all be gone.
2020-08-25 19:45:38 +02:00
Lennart Poettering bb0c0d6f29 core: add credentials logic
Fixes: #15778 #16060
2020-08-25 19:45:35 +02:00
Lennart Poettering 4e39995371 core: introduce ProtectProc= and ProcSubset= to expose hidepid= and subset= procfs mount options
Kernel 5.8 gained a hidepid= implementation that is truly per procfs,
which allows us to mount a distinct once into every unit, with
individual hidepid= settings. Let's expose this via two new settings:
ProtectProc= (wrapping hidpid=) and ProcSubset= (wrapping subset=).

Replaces: #11670
2020-08-24 20:11:02 +02:00
Lennart Poettering 52b3d6523f namespace: move protect_{home|system} into NamespaceInfo
it's not entirely clear what shall be passed via parameter and what via
struct, but these two definitely fit well with the other protect_xyz
fields, hence let's move them over.

We probably should move a lot more more fields into the structure
actuall (most? all even?).
2020-08-24 20:10:30 +02:00
Luca Boccassi 427353f668 core: add mount options support for MountImages
Follow the same model established for RootImage and RootImageOptions,
and allow to either append a single list of options or tuples of
partition_number:options.
2020-08-20 14:45:40 +01:00
Luca Boccassi 9ece644435 core: change RootImageOptions to use names instead of partition numbers
Follow the designations from the Discoverable Partitions Specification
2020-08-20 13:58:02 +01:00
Luca Boccassi b3d133148e core: new feature MountImages
Follows the same pattern and features as RootImage, but allows an
arbitrary mount point under / to be specified by the user, and
multiple values - like BindPaths.

Original implementation by @topimiettinen at:
https://github.com/systemd/systemd/pull/14451
Reworked to use dissect's logic instead of bare libmount() calls
and other review comments.
Thanks Topi for the initial work to come up with and implement
this useful feature.
2020-08-05 21:34:55 +01:00