Commit graph

138 commits

Author SHA1 Message Date
Anita Zhang 7ca69792e5 core: add ':' prefix to ExecXYZ= skip env var substitution 2019-02-20 17:58:14 +01:00
Topi Miettinen aecd5ac621 core: ProtectHostname= feature
Let services use a private UTS namespace. In addition, a seccomp filter is
installed on set{host,domain}name and a ro bind mounts on
/proc/sys/kernel/{host,domain}name.
2019-02-20 10:50:44 +02:00
Alexey Bogdanenko 8f9f3cb724 core: fix KeyringMode for user services
KeyringMode option is useful for user services. Also, documentation for the
option suggests that the option applies to user services. However, setting the
option to any of its allowed values has no effect.

This commit fixes that and removes EXEC_NEW_KEYRING flag. The flag is no longer
necessary: instead of checking if the flag is set we can check if keyring_mode
is not equal to EXEC_KEYRING_INHERIT.
2018-12-17 16:56:36 +01:00
Yu Watanabe e93672eeac tree-wide: drop missing.h from headers and use relevant missing_*.h 2018-12-06 13:31:16 +01:00
Lennart Poettering 78f93209fc core: when Delegate=yes is set for a unit, run ExecStartPre= and friends in a subcgroup of the unit
Otherwise we might conflict with the "no-processes-in-inner-cgroup" rule
of cgroupsv2. Consider nspawn starting up and initializing its cgroup
hierarchy with "supervisor/" and "payload/" as subcgroup, with itself
moved into the former and the payload into the latter. Now, if an
ExecStartPre= is run right after it cannot be placed in the main cgroup,
because that is now in inner cgroup with populated children.

Hence, let's run these helpers in another sub-cgroup .control/ below it.

This is somewhat ugly since it weakens the clear separation of
ownership, but given that this is an explicit contract, and double opt-in should be acceptable.

Fixes: #10482
2018-11-26 18:43:23 +01:00
Yu Watanabe b9c04eafb8 core: introduce exec_params_clear()
Follow-up for 1ad6e8b302.

Fixes #10677.
2018-11-08 09:36:37 +01:00
Lennart Poettering 7593c3ecbf tree-wide: add a few missing _sentinel_ decorators 2018-10-26 19:49:15 +02:00
Anita Zhang 90fc172e19 core: implement per unit journal rate limiting
Add LogRateLimitIntervalSec= and LogRateLimitBurst= options for
services. If provided, these values get passed to the journald
client context, and those values are used in the rate limiting
function in the journal over the the journald.conf values.

Part of #10230
2018-10-18 09:56:20 +02:00
Yu Watanabe cf4b2f9906 tree-wide: use unsigned for refcount 2018-08-27 13:48:04 +09:00
Zbigniew Jędrzejewski-Szmek 5b316330be
Merge pull request #9624 from poettering/service-state-flush
flush out ExecStatus structures when a new service cycle begins
2018-08-02 09:50:39 +02:00
Lennart Poettering 5686391b00 core: introduce new Type=exec service type
Users are often surprised that "systemd-run" command lines like
"systemd-run -p User=idontexist /bin/true" will return successfully,
even though the logs show that the process couldn't be invoked, as the
user "idontexist" doesn't exist. This is because Type=simple will only
wait until fork() succeeded before returning start-up success.

This patch adds a new service type Type=exec, which is very similar to
Type=simple, but waits until the child process completed the execve()
before returning success. It uses a pipe that has O_CLOEXEC set for this
logic, so that the kernel automatically sends POLLHUP on it when the
execve() succeeded but leaves the pipe open if not. This means PID 1
waits exactly until the execve() succeeded in the child, and not longer
and not shorter, which is the desired functionality.

Making use of this new functionality, the command line
"systemd-run -p User=idontexist -p Type=exec /bin/true" will now fail,
as expected.
2018-07-25 22:48:11 +02:00
Lennart Poettering ce0d60a7c4 execute: use our usual syntax for defining bit masks 2018-07-25 22:48:11 +02:00
Lennart Poettering 25b583d7ff core: swap order of "n_storage_fds" and "n_socket_fds" parameters
When process fd lists to pass to activated programs we always place the
socket activation fds first, and the storage fds last. Irritatingly in
almost all calls the "n_storage_fds" parameter (i.e. the number of
storage fds to pass) came first so far, and the "n_socket_fds" parameter
second. Let's clean this up, and specify the number of fds in the order
the fds themselves are passed.

(Also, let's fix one more case where "unsigned" was used to size an
array, while we should use "size_t" instead.)
2018-07-25 22:48:11 +02:00
Lennart Poettering 6a1d4d9fa6 core: properly reset all ExecStatus structures when entering a new unit cycle
Whenever a unit is started fresh we should flush out any runtime data
from the previous cycle. We are pretty good at that already, but what so
far we missed was the ExecStart=/ExecStop=/… command exit status data.
Let's fix that, and properly flush out that stuff too.

Consider this service:

    [Service]
    ExecStart=/bin/sleep infinity
    ExecStop=/bin/false

When this service is started, then stopped and then started again
"systemctl status" would show the ExecStop= results of the previous run
along with the ExecStart= results of the current one, which is very
confusing. With this patch this is corrected: the data is kept right
until the moment the new service cycle starts, and then flushed out.
Hence "systemctl status" in that case will only show the ExecStart=
data, but no ExecStop= data, like it should be.

This should fix part of the confusion of #9588
2018-07-23 13:36:47 +02:00
Lennart Poettering 42cb05d5ff execute: document what the different structures are for in comments 2018-07-23 13:36:47 +02:00
Lennart Poettering ee39ca20c6 core: drop "argv" field from ExecParameter structure
We always initialize it from the same field in ExecCommand anyway, hence
there's no point in passing it separately to exec_spawn(), after all we
already pass the ExecCommand structure itself anyway.

No change in behaviour.
2018-07-23 13:36:47 +02:00
Zsolt Dollenstein 566b7d23eb Add support for opening files for appending
Addresses part of #8983
2018-07-20 03:54:22 -07:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Zbigniew Jędrzejewski-Szmek 70127be805
Merge pull request #9153 from poettering/private-mounts
introduce PrivateMounts= setting and clean up documentation for MountFlags=
2018-06-13 08:20:18 +02:00
Lennart Poettering ef31828d06 tree-wide: unify how we define bit mak enums
Let's always write "1 << 0", "1 << 1" and so on, except where we need
more than 31 flag bits, where we write "UINT64(1) << 0", and so on to force
64bit values.
2018-06-12 21:44:00 +02:00
Lennart Poettering 228af36fff core: add new PrivateMounts= unit setting
This new setting is supposed to be useful in most cases where
"MountFlags=slave" is currently used, i.e. as an explicit way to run a
service in its own mount namespace and decouple propagation from all
mounts of the new mount namespace towards the host.

The effect of MountFlags=slave and PrivateMounts=yes is mostly the same,
as both cause a CLONE_NEWNS namespace to be opened, and both will result
in all mounts within it to be mounted MS_SLAVE. The difference is mostly
on the conceptual/philosophical level: configuring the propagation mode
is nothing people should have to think about, in particular as the
matter is not precisely easyto grok. Moreover, MountFlags= allows configuration
of "private" and "slave" modes which don't really make much sense to use
in real-life and are quite confusing. In particular PrivateMounts=private means
mounts made on the host stay pinned for good by the service which is
particularly nasty for removable media mount. And PrivateMounts=shared
is in most ways a NOP when used a alone...

The main technical difference between setting only MountFlags=slave or
only PrivateMounts=yes in a unit file is that the former remounts all
mounts to MS_SLAVE and leaves them there, while that latter remounts
them to MS_SHARED again right after. The latter is generally a nicer
approach, since it disables propagation, while MS_SHARED is afterwards
in effect, which is really nice as that means further namespacing down
the tree will get MS_SHARED logic by default and we unify how
applications see our mounts as we always pass them as MS_SHARED
regardless whether any mount namespacing is used or not.

The effect of PrivateMounts=yes was implied already by all the other
mount namespacing options. With this new option we add an explicit knob
for it, to request it without any other option used as well.

See: #4393
2018-06-12 16:12:10 +02:00
Lennart Poettering da6053d0a7 tree-wide: be more careful with the type of array sizes
Previously we were a bit sloppy with the index and size types of arrays,
we'd regularly use unsigned. While I don't think this ever resulted in
real issues I think we should be more careful there and follow a
stricter regime: unless there's a strong reason not to use size_t for
array sizes and indexes, size_t it should be. Any allocations we do
ultimately will use size_t anyway, and converting forth and back between
unsigned and size_t will always be a source of problems.

Note that on 32bit machines "unsigned" and "size_t" are equivalent, and
on 64bit machines our arrays shouldn't grow that large anyway, and if
they do we have a problem, however that kind of overly large allocation
we have protections for usually, but for overflows we do not have that
so much, hence let's add it.

So yeah, it's a story of the current code being already "good enough",
but I think some extra type hygiene is better.

This patch tries to be comprehensive, but it probably isn't and I missed
a few cases. But I guess we can cover that later as we notice it. Among
smaller fixes, this changes:

1. strv_length()' return type becomes size_t

2. the unit file changes array size becomes size_t

3. DNS answer and query array sizes become size_t

Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=76745
2018-04-27 14:29:06 +02:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Yu Watanabe 2abd4e388a core: add new setting TemporaryFileSystem=
This introduces a new setting TemporaryFileSystem=. This is useful
to hide files not relevant to the processes invoked by unit, while
necessary files or directories can be still accessed by combining
with Bind{,ReadOnly}Paths=.
2018-02-21 09:17:52 +09:00
Yu Watanabe 34cf6c4340 core/execute: make arguments constant if possible
Also make functions static if possible.
2018-02-06 16:00:50 +09:00
Yu Watanabe e8a565cb66 core: make ExecRuntime be manager managed object
Before this, each ExecRuntime object is owned by a unit. However,
it may be shared with other units which enable JoinsNamespaceOf=.
Thus, by the serialization/deserialization process, its sharing
information, more specifically, reference counter is lost, and
causes issue #7790.

This makes ExecRuntime objects be managed by manager, and changes
the serialization/deserialization process.

Fixes #7790.
2018-02-06 16:00:34 +09:00
Lennart Poettering 0133d5553a
Merge pull request #7198 from poettering/stdin-stdout
Add StandardInput=data, StandardInput=file:... and more
2017-11-19 19:49:11 +01:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Lennart Poettering 2038c3f584 core: add support for StandardInputFile= and friends
These new settings permit specifiying arbitrary paths as
stdin/stdout/stderr locations. We try to open/create them as necessary.
Some special magic is applied:

1) if the same path is specified for both input and output/stderr, we'll
   open it only once O_RDWR, and duplicate them fd instead.

2) If we an AF_UNIX socket path is specified, we'll connect() to it,
   rather than open() it. This allows invoking systemd services with
   stdin/stdout/stderr connected to arbitrary foreign service sockets.

Fixes: #3991
2017-11-17 11:13:44 +01:00
Lennart Poettering 08f3be7a38 core: add two new unit file settings: StandardInputData= + StandardInputText=
Both permit configuring data to pass through STDIN to an invoked
process. StandardInputText= accepts a line of text (possibly with
embedded C-style escapes as well as unit specifiers), which is appended
to the buffer to pass as stdin, followed by a single newline.
StandardInputData= is similar, but accepts arbitrary base64 encoded
data, and will not resolve specifiers or C-style escapes, nor append
newlines.

This may be used to pass input/configuration data to services, directly
in-line from unit files, either in a cooked or in a more raw format.
2017-11-17 11:13:44 +01:00
Lennart Poettering d3070fbdf6 core: implement /run/systemd/units/-based path for passing unit info from PID 1 to journald
And let's make use of it to implement two new unit settings with it:

1. LogLevelMax= is a new per-unit setting that may be used to configure
   log priority filtering: set it to LogLevelMax=notice and only
   messages of level "notice" and lower (i.e. more important) will be
   processed, all others are dropped.

2. LogExtraFields= is a new per-unit setting for configuring per-unit
   journal fields, that are implicitly included in every log record
   generated by the unit's processes. It takes field/value pairs in the
   form of FOO=BAR.

Also, related to this, one exisiting unit setting is ported to this new
facility:

3. The invocation ID is now pulled from /run/systemd/units/ instead of
   cgroupfs xattrs. This substantially relaxes requirements of systemd
   on the kernel version and the privileges it runs with (specifically,
   cgroupfs xattrs are not available in containers, since they are
   stored in kernel memory, and hence are unsafe to permit to lesser
   privileged code).

/run/systemd/units/ is a new directory, which contains a number of files
and symlinks encoding the above information. PID 1 creates and manages
these files, and journald reads them from there.

Note that this is supposed to be a direct path between PID 1 and the
journal only, due to the special runtime environment the journal runs
in. Normally, today we shouldn't introduce new interfaces that (mis-)use
a file system as IPC framework, and instead just an IPC system, but this
is very hard to do between the journal and PID 1, as long as the IPC
system is a subject PID 1 manages, and itself a client to the journal.

This patch cleans up a couple of types used in journal code:
specifically we switch to size_t for a couple of memory-sizing values,
as size_t is the right choice for everything that is memory.

Fixes: #4089
Fixes: #3041
Fixes: #4441
2017-11-16 12:40:17 +01:00
Yu Watanabe 8cfa775f4f core: add support to specify errno in SystemCallFilter=
This makes each system call in SystemCallFilter= blacklist optionally
takes errno name or number after a colon. The errno takes precedence
over the one given by SystemCallErrorNumber=.

C.f. #7173.
Closes #7169.
2017-11-11 21:54:12 +09:00
Lennart Poettering 72fd17682d core: usually our enum's _INVALID and _MAX special values are named after the full type
In most cases we followed the rule that the special _INVALID and _MAX
values we use in our enums use the full type name as prefix (in contrast
to regular values that we often make shorter), do so for
ExecDirectoryType as well.

No functional changes, just a little bit of renaming to make this code
more like the rest.
2017-10-02 17:41:43 +02:00
Lennart Poettering b1edf4456e core: add new per-unit setting KeyringMode= for controlling kernel keyring setup
Usually, it's a good thing that we isolate the kernel session keyring
for the various services and disconnect them from the user keyring.
However, in case of the cryptsetup key caching we actually want that
multiple instances of the cryptsetup service can share the keys in the
root user's user keyring, hence we need to be able to disable this logic
for them.

This adds KeyringMode=inherit|private|shared:

    inherit: don't do any keyring magic (this is the default in systemd --user)
    private: a private keyring as before (default in systemd --system)
    shared: the new setting
2017-09-15 16:53:35 +02:00
Lennart Poettering 00819cc151 core: add new UnsetEnvironment= setting for unit files
With this setting we can explicitly unset specific variables for
processes of a unit, as last step of assembling the environment block
for them. This is useful to fix #6407.

While we are at it, greatly expand the documentation on how the
environment block for forked off processes is assembled.
2017-09-14 15:17:40 +02:00
Topi Miettinen 78e864e5b3 seccomp: LockPersonality boolean (#6193)
Add LockPersonality boolean to allow locking down personality(2)
system call so that the execution domain can't be changed.
This may be useful to improve security because odd emulations
may be poorly tested and source of vulnerabilities, while
system services shouldn't need any weird personalities.
2017-08-29 15:54:50 +02:00
Lennart Poettering 165a31c0db core: add two new special ExecStart= character prefixes
This patch adds two new special character prefixes to ExecStart= and
friends, in addition to the existing "-", "@" and "+":

"!"  → much like "+", except with a much reduced effect as it only
       disables the actual setresuid()/setresgid()/setgroups() calls, but
       leaves all other security features on, including namespace
       options. This is very useful in combination with
       RuntimeDirectory= or DynamicUser= and similar option, as a user
       is still allocated and used for the runtime directory, but the
       actual UID/GID dropping is left to the daemon process itself.
       This should make RuntimeDirectory= a lot more useful for daemons
       which insist on doing their own privilege dropping.

"!!" → Similar to "!", but on systems supporting ambient caps this
       becomes a NOP. This makes it relatively straightforward to write
       unit files that make use of ambient capabilities to let systemd
       drop all privs while retaining compatibility with systems that
       lack ambient caps, where priv dropping is the left to the daemon
       codes themselves.

This is an alternative approach to #6564 and related PRs.
2017-08-10 15:04:32 +02:00
Lennart Poettering 1703fa41a7 core: rename EXEC_APPLY_PERMISSIONS → EXEC_APPLY_SANDBOXING
"Permissions" was a bit of a misnomer, as it suggests that UNIX file
permission bits are adjusted, which aren't really changed here. Instead,
this is about UNIX credentials such as users or groups, as well as
namespacing, hence let's use a more generic term here, without any
misleading reference to UNIX file permissions: "sandboxing", which shall
refer to all kinds of sandboxing technologies, including UID/GID
dropping, selinux relabelling, namespacing, seccomp, and so on.
2017-08-10 15:02:50 +02:00
Lennart Poettering 584b8688d1 execute: also fold the cgroup delegate bit into ExecFlags 2017-08-10 15:02:50 +02:00
Lennart Poettering ac6479781e execute: also control the SYSTEMD_NSS_BYPASS_BUS through an ExecFlags field
Also, correct the logic while we are at it: the variable is only
required for system services, not user services.
2017-08-10 15:02:49 +02:00
Lennart Poettering 8679efde21 execute: add one more ExecFlags flag, for controlling unconditional directory chowning
Let's decouple the Manager object from the execution logic a bit more
here too, and simply pass along the fact whether we should
unconditionally chown the runtime/... directories via the ExecFlags
field too.
2017-08-10 14:44:58 +02:00
Lennart Poettering af635cf377 execute: let's decouple execute.c a bit from the unit logic
Let's try to decouple the execution engine a bit from the Unit/Manager
concept, and hence pass one more flag as part of the ExecParameters flags
field.
2017-08-10 14:44:58 +02:00
Lennart Poettering 3ed0cd26ea execute: replace command flag bools by a flags field
This way, we can extend it later on in an easier way, and can pass it
along nicely.
2017-08-10 14:44:58 +02:00
Yu Watanabe 3536f49e8f core: add {State,Cache,Log,Configuration}Directory= (#6384)
This introduces {State,Cache,Log,Configuration}Directory= those are
similar to RuntimeDirectory=. They create the directories under
/var/lib, /var/cache/, /var/log, or /etc, respectively, with the mode
specified in {State,Cache,Log,Configuration}DirectoryMode=.

This also fixes #6391.
2017-07-18 14:34:52 +02:00
Yu Watanabe 53f47dfc7b core: allow preserving contents of RuntimeDirectory= over process restart
This introduces RuntimeDirectoryPreserve= option which takes a boolean
argument or 'restart'.

Closes #6087.
2017-07-17 16:22:25 +09:00
Lennart Poettering 7f452159b8 core: make IOSchedulingClass= and IOSchedulingPriority= settable for transient units
This patch is a bit more complex thant I hoped. In particular the single
IOScheduling= property exposed on the bus is split up into
IOSchedulingClass= and IOSchedulingPriority= (though compat is
retained). Otherwise the asymmetry between setting props and getting
them is a bit too nasty.

Fixes #5613
2017-06-26 17:43:18 +02:00
Franck Bui 4c47affcf1 core: remove the redundancy of 'n_fds' and 'n_storage_fds' in ExecParameters struct
'n_fds' field in the ExecParameters structure was counting the total number of
file descriptors to be passed to a unit.

This counter also includes the number of passed socket fds which is counted by
'n_socket_fds' already.

This patch removes that redundancy by replacing 'n_fds' with
'n_storage_fds'. The new field only counts the fds passed via the storage store
mechanism.  That way each fd is counted at one place only.

Subsequently the patch makes sure to fix code that used 'n_fds' and also wanted
to iterate through all of them by explicitly adding 'n_socket_fds' + 'n_storage_fds'.

Suggested by Lennart.
2017-06-08 16:21:35 +02:00
Franck Bui 9b1419111a core: only apply NonBlocking= to fds passed via socket activation
Make sure to only apply the O_NONBLOCK flag to the fds passed via socket
activation.

Previously the flag was also applied to the fds which came from the fd store
but this was incorrect since services, after being restarted, expect that these
passed fds have their flags unchanged and can be reused as before.

The documentation was a bit unclear about this so clarify it.
2017-06-06 22:42:50 +02:00
Lennart Poettering 915e6d1676 core: add RootImage= setting for using a specific image file as root directory for a service
This is similar to RootDirectory= but mounts the root file system from a
block device or loopback file instead of another directory.

This reuses the image dissector code now used by nspawn and
gpt-auto-discovery.
2017-02-07 12:19:42 +01:00