Commit Graph

4234 Commits

Author SHA1 Message Date
Yu Watanabe 0de4876496 core/socket: fix memleak in the error paths in usbffs_dispatch_eps() 2018-09-03 14:25:08 +09:00
Yu Watanabe 0c09cb0e78
Merge pull request #9977 from sourcejedi/no-remount-superblock3
Namespace fixes
2018-09-01 23:18:01 +09:00
Alan Jenkins fcac12d150 namespace: remove redundant .has_prefix=false
The MountEntry's added for EMPTY_DIR work very similarly to the TMPFS ones.
In both cases, .has_prefix is false.  In fact, .has_prefix is false in
*all* the MountEntry's we add except for the access mounts (READONLY etc).

But EMPTY_DIR stuck out by explicitly setting .has_prefix = false.
Let's remove that.
2018-09-01 17:23:01 +09:00
Alan Jenkins 4a756839e6 namespace: we always use a root_directory now
We changed to always setup the new namespace in a separate directory
(commit 0722b35)
2018-09-01 17:23:01 +09:00
Alan Jenkins ad8e66dcc4 namespace: fix mode for TemporaryFileSystem=
... when no mount options are passed.

Change the code, to avoid the following failure in the newly added tests:

exec-temporaryfilesystem-rw.service: Executing: /usr/bin/sh -x -c
'[ "$(stat -c %a /var)" == 755 ]'
++ stat -c %a /var
+ '[' 1777 == 755 ']'
Received SIGCHLD from PID 30364 (sh).
Child 30364 (sh) died (code=exited, status=1/FAILURE)

(And I spotted an opportunity to use TAKE_PTR() at the end).
2018-09-01 17:22:14 +09:00
Alan Jenkins 69338c3dfb namespace: don't try to remount superblocks
We can't remount the underlying superblocks, if we are inside a user
namespace and running Linux <= 4.17.  We can only change the per-mount
flags (MS_REMOUNT | MS_BIND).

This type of mount() call can only change the per-mount flags, so we
don't have to worry about passing the right string options now.

Fixes #9914 ("Since 1beab8b was merged, systemd has been failing to start
systemd-resolved inside unprivileged containers" ... "Failed to re-mount
'/run/systemd/unit-root/dev' read-only: Operation not permitted").

> It's basically my fault :-). I pointed out we could remount read-only
> without MS_BIND when reviewing the PR that added TemporaryFilesystem=,
> and poettering suggested to change PrivateDevices= at the same time.
> I think it's safe to change back, and I don't expect anyone will notice
> a difference in behaviour.
>
> It just surprised me to realize that
> `TemporaryFilesystem=/tmp:size=10M,ro,nosuid` would not apply `ro` to the
> superblock (underlying filesystem), like mount -osize=10M,ro,nosuid does.
> Maybe a comment could note the kernel version (v4.18), that lets you
> remount without MS_BIND inside a user namespace.

This makes the code longer and I guess this function is still ugly, sorry.
One obstacle to cleaning it up is the interaction between
`PrivateDevices=yes` and `ReadOnlyPaths=/dev`.  I've added a test for the
existing behaviour, which I think is now the correct behaviour.
2018-08-30 11:17:16 +01:00
Yu Watanabe 3343021755 dynamic-user: drop unnecessary initialization 2018-08-29 23:04:26 +09:00
Yu Watanabe 288ca7af8c dynamic-user: fix potential segfault 2018-08-28 05:09:00 +09:00
Yu Watanabe 8301aa0bf1 tree-wide: use DEFINE_TRIVIAL_REF_UNREF_FUNC() macro or friends where applicable 2018-08-27 14:01:46 +09:00
Yu Watanabe cf4b2f9906 tree-wide: use unsigned for refcount 2018-08-27 13:48:04 +09:00
Yu Watanabe 4366e598ae core: replace udev_device by sd_device 2018-08-23 04:57:39 +09:00
Yu Watanabe 6bcf00eda3 core/umount: replace udev_device by sd_device 2018-08-23 04:57:39 +09:00
Tejun Heo 6ae4283cb1 core: add IODeviceLatencyTargetSec
This adds support for the following proposed latency based IO control
mechanism.

  https://lkml.org/lkml/2018/6/5/428
2018-08-22 16:46:18 +02:00
Yu Watanabe 52e4d62550
Merge pull request #9852 from poettering/namespace-errno
namespace: be more careful when handling namespacing failures
2018-08-22 11:16:29 +09:00
Lennart Poettering 1beab8b0d0 namespace: be more careful when handling namespacing failures gracefully
This makes two changes to the namespacing code:

1. We'll only gracefully skip service namespacing on access failure if
   exclusively sandboxing options where selected, and not mount-related
   options that result in a very different view of the world. For example,
   ignoring RootDirectory=, RootImage= or Bind= is really probablematic,
   but ReadOnlyPaths= is just a weaker sandbox.

2. The namespacing code will now return a clearly recognizable error
   code when it cannot enforce its namespacing, so that we cannot
   confuse EPERM errors from mount() with those from unshare(). Only the
   errors from the first unshare() are now taken as hint to gracefully
   disable namespacing.

Fixes: #9844 #9835
2018-08-21 20:00:33 +02:00
aszlig 66c91c3a23 umount: Don't use options from fstab on remount
The fstab entry may contain comment/application-specific options, like
for example x-systemd.automount or x-initrd.mount.

With the recent switch to libmount, the mount options during remount are
now gathered via mnt_fs_get_options(), which returns the merged fstab
options with the effective options in mountinfo.

Unfortunately if one of these application-specific options are set in
fstab, the remount will fail with -EINVAL.

In systemd 238:

  Remounting '/test-x-initrd-mount' read-only in with options
  'errors=continue,user_xattr,acl'.

In systemd 239:

  Remounting '/test-x-initrd-mount' read-only in with options
  'errors=continue,user_xattr,acl,x-initrd.mount'.
  Failed to remount '/test-x-initrd-mount' read-only: Invalid argument

So instead of using mnt_fs_get_options(), we're now using both
mnt_fs_get_fs_options() and mnt_fs_get_vfs_options() and merging the
results together so we don't get any non-relevant options from fstab.

Signed-off-by: aszlig <aszlig@nix.build>
2018-08-21 19:49:51 +02:00
Zbigniew Jędrzejewski-Szmek 0566668016
Merge pull request #9712 from filbranden/socket1
socket-util: Introduce send_one_fd_iov() and receive_one_fd_iov()
2018-08-21 19:45:44 +02:00
Zbigniew Jędrzejewski-Szmek 7692fed98b
Merge pull request #9783 from poettering/get-user-creds-flags
beef up get_user_creds() a bit and other improvements
2018-08-21 10:09:33 +02:00
Zbigniew Jędrzejewski-Szmek 00c4361878
Merge pull request #9853 from poettering/uneeded-queue
rework StopWhenUnneeded=1 logic
2018-08-21 10:06:30 +02:00
Lennart Poettering fafff8f1ff user-util: rework get_user_creds()
Let's fold get_user_creds_clean() into get_user_creds(), and introduce a
flags argument for it to select "clean" behaviour. This flags parameter
also learns to other new flags:

- USER_CREDS_SYNTHESIZE_FALLBACK: in this mode the user records for
  root/nobody are only synthesized as fallback. Normally, the synthesized
  records take precedence over what is in the user database.  With this
  flag set this is reversed, and the user database takes precedence, and
  the synthesized records are only used if they are missing there. This
  flag should be set in cases where doing NSS is deemed safe, and where
  there's interest in knowing the correct shell, for example if the
  admin changed root's shell to zsh or suchlike.

- USER_CREDS_ALLOW_MISSING: if set, and a UID/GID is specified by
  numeric value, and there's no user/group record for it accept it
  anyway. This allows us to fix #9767

This then also ports all users to set the most appropriate flags.

Fixes: #9767

[zj: remove one isempty() call]
2018-08-20 15:58:21 +02:00
Lennart Poettering b2a60844c4 namespace: when creating device nodes, also create /dev/char/* symlinks
On the host these symlinks are created by udev, and we consider them API
and make use of them ourselves at various places. Hence when running a
private /dev, also create these symlinks so that lookups by major/minor
work in such an environment, too.
2018-08-20 15:58:11 +02:00
Zbigniew Jędrzejewski-Szmek a9e241dcb3
Merge pull request #9801 from yuwata/analyze-cleanups
analyze: several improvements
2018-08-20 13:12:53 +02:00
Lennart Poettering 3cd24c1aa9 core: when setting up PAM, try to get tty of STDIN_FILENO if not set explicitly
When stdin/stdout/stderr is initialized from an fd, let's read the tty
name of it if we can, and pass that to PAM.

This makes sure that "machinectl shell" sessions have proper TTY fields
initialized that "loginctl" then shows.
2018-08-20 12:28:17 +02:00
Lennart Poettering 37ec0fdd34 tree-wide: add clickable man page link to all --help texts
This is a bit like the info link in most of GNU's --help texts, but we
don't do info but man pages, and we make them properly clickable on
terminal supporting that, because awesome.

I think it's generally advisable to link up our (brief) --help texts and
our (more comprehensive) man pages a bit, so this should be an easy and
straight-forward way to do it.
2018-08-20 11:33:04 +02:00
Zbigniew Jędrzejewski-Szmek fda09318e3 core: rename function to better reflect semantics 2018-08-20 10:43:31 +02:00
Lennart Poettering a3c1168ac2 core: rework StopWhenUnneeded= logic
Previously, we'd act immediately on StopWhenUnneeded= when a unit state
changes. With this rework we'll maintain a queue instead: whenever
there's the chance that StopWhenUneeded= might have an effect we enqueue
the unit, and process it later when we have nothing better to do.

This should make the implementation a bit more reliable, as the unit notify event
cannot immediately enqueue tons of side-effect jobs that might
contradict each other, but we do so only in a strictly ordered fashion,
from the main event loop.

This slightly changes the check when to consider a unit "unneeded".
Previously, we'd assume that a unit in "deactivating" state could also
be cleaned up. With this new logic we'll only consider units unneeded
that are fully up and have no job queued. This means that whenever
there's something pending for a unit we won't clean it up.
2018-08-10 16:19:01 +02:00
Zbigniew Jędrzejewski-Szmek b257b19e6b
Merge pull request #9848 from yuwata/fix-9835-9844
core: namespace fixes
2018-08-10 15:36:34 +02:00
Yu Watanabe 4c3a2b84d8 core/execute: fix dump format for Limit*=
Fixes #9846.
2018-08-10 11:59:16 +02:00
Yu Watanabe 763a260ae7 core/namespace: add more log messages
Suggested by #9835.
2018-08-10 14:30:35 +09:00
Yu Watanabe 7e8d494b33 core: use memcpy_safe()
Fixes #9738.
2018-08-08 17:11:43 +09:00
Lennart Poettering 91f4424012
Merge pull request #9817 from yuwata/shorten-error-logging
tree-wide: Shorten error logging and several code cleanups
2018-08-07 10:44:44 +02:00
Lennart Poettering 6f663594bc
Merge pull request #9744 from yuwata/fix-9737
Make RootImage= work with PrivateDevices=
2018-08-07 09:55:07 +02:00
Yu Watanabe fc95c359f6 tree-wide: use returned value from log_*_errno() 2018-08-07 15:48:37 +09:00
Filipe Brandenburger a0edd02e43 tree-wide: Convert compare_func's to use CMP() macro wherever possible.
Looked for definitions of functions using the *_compare_func() suffix.

Tested:
- Unit tests passed (ninja -C build/ test)
- Installed this build and booted with it.
2018-08-06 19:26:35 -07:00
Yu Watanabe 4ae25393f3 tree-wide: shorten error logging a bit
Continuation of 4027f96aa0.
2018-08-07 10:14:33 +09:00
Yu Watanabe 7bc740f480 core: add comments about timestamps stored in manager 2018-08-06 22:21:05 +09:00
Yu Watanabe fe65e88ba6 namespace: implicitly adds DeviceAllow= when RootImage= is set
RootImage= may require the following settings
```
DeviceAllow=/dev/loop-control rw
DeviceAllow=block-loop rwm
DeviceAllow=block-blkext rwm
```
This adds the following settings implicitly when RootImage= is
specified.

Fixes #9737.
2018-08-06 14:02:31 +09:00
Yu Watanabe fd870bac25 core: introduce cgroup_add_device_allow() 2018-08-06 13:42:14 +09:00
Yu Watanabe 839f187753 core/namespace: drop mount points outside of root even if RootDirectory= is not set 2018-08-06 12:51:33 +09:00
Yu Watanabe 9b68367b3a core/namespace: drop conditions depends on `root` is empty or not
After 0722b35934, the variable `root`
is always set.
2018-08-06 12:51:33 +09:00
Filipe Brandenburger d34673ecb8 socket-util: Introduce send_one_fd_iov() and receive_one_fd_iov()
These take a struct iovec to send data together with the passed FD.

The receive function returns the FD through an output argument. In case data is
received, but no FD is passed, the receive function will set the output
argument to -1 explicitly.

Update code in dynamic-user to use the new helpers.
2018-08-02 09:25:04 -07:00
Zbigniew Jędrzejewski-Szmek 5b316330be
Merge pull request #9624 from poettering/service-state-flush
flush out ExecStatus structures when a new service cycle begins
2018-08-02 09:50:39 +02:00
Zbigniew Jędrzejewski-Szmek 7426028b7a
Merge pull request #9720 from yuwata/fix-9702
Fix DynamicUser=yes with static User= whose UID and GID are different
2018-07-26 11:42:00 +02:00
Zbigniew Jędrzejewski-Szmek 54fe2ce1b9
Merge pull request #9504 from poettering/nss-deadlock
some nss deadlock love
2018-07-26 10:16:25 +02:00
Zbigniew Jędrzejewski-Szmek cf6e28f3cb
Merge pull request #9484 from poettering/permille-everywhere
Permille everywhere
2018-07-26 10:13:56 +02:00
Yu Watanabe 25a1df7c65 core: fix gid when DynamicUser=yes with static User=
When DynamicUser=yes and static User= are set, and the user has
different uid and gid, then as the storage socket for the dynamic
user does not contains gid, we need to obtain gid.

Follow-up for 9ec655cbbd.

Fixes #9702.
2018-07-26 15:38:18 +09:00
Lennart Poettering 5686391b00 core: introduce new Type=exec service type
Users are often surprised that "systemd-run" command lines like
"systemd-run -p User=idontexist /bin/true" will return successfully,
even though the logs show that the process couldn't be invoked, as the
user "idontexist" doesn't exist. This is because Type=simple will only
wait until fork() succeeded before returning start-up success.

This patch adds a new service type Type=exec, which is very similar to
Type=simple, but waits until the child process completed the execve()
before returning success. It uses a pipe that has O_CLOEXEC set for this
logic, so that the kernel automatically sends POLLHUP on it when the
execve() succeeded but leaves the pipe open if not. This means PID 1
waits exactly until the execve() succeeded in the child, and not longer
and not shorter, which is the desired functionality.

Making use of this new functionality, the command line
"systemd-run -p User=idontexist -p Type=exec /bin/true" will now fail,
as expected.
2018-07-25 22:48:11 +02:00
Lennart Poettering ce0d60a7c4 execute: use our usual syntax for defining bit masks 2018-07-25 22:48:11 +02:00
Lennart Poettering 25b583d7ff core: swap order of "n_storage_fds" and "n_socket_fds" parameters
When process fd lists to pass to activated programs we always place the
socket activation fds first, and the storage fds last. Irritatingly in
almost all calls the "n_storage_fds" parameter (i.e. the number of
storage fds to pass) came first so far, and the "n_socket_fds" parameter
second. Let's clean this up, and specify the number of fds in the order
the fds themselves are passed.

(Also, let's fix one more case where "unsigned" was used to size an
array, while we should use "size_t" instead.)
2018-07-25 22:48:11 +02:00
Lennart Poettering f806dfd345 tree-wide: increase granularity of percent specifications all over the place to permille
We so far had various placed we'd parse percentages with
parse_percent(). Let's make them use parse_permille() instead, which is
downward compatible (as it also parses percent values), and increases
the granularity a bit. Given that on the wire we usually normalize
relative specifications to something like UINT32_MAX anyway changing
from base-100 to base-1000 calculations can be done easily without
breaking compat.

This commit doesn't document this change in the man pages. While
allowing more precise specifcations permille is not as commonly
understood as perent I guess, hence let's keep this out of the docs for
now.
2018-07-25 16:14:45 +02:00