Commit graph

675 commits

Author SHA1 Message Date
Franck Bui f75f613d25 core: reduce the number of stalled PIDs from the watched processes list when possible
Some PIDs can remain in the watched list even though their processes have
exited since a long time. It can easily happen if the main process of a forking
service manages to spawn a child before the control process exits for example.

However when a pid is about to be mapped to a unit by calling unit_watch_pid(),
the caller usually knows if the pid should belong to this unit exclusively: if
we just forked() off a child, then we can be sure that its PID is otherwise
unused. In this case we take this opportunity to remove any stalled PIDs from
the watched process list.

If we learnt about a PID in any other form (for example via PID file, via
searching, MAINPID= and so on), then we can't assume anything.
2019-03-20 10:51:49 +01:00
Lennart Poettering 9adb695987 core: split error list in comment for unit_start() in two 2019-03-18 16:06:36 +01:00
Lennart Poettering 36c4dc089e core: change emergency_action() to return void
The function so far always returned -ECANCELLED, which is ignored in all
cases the function is invoked, except one: in unit_test_start_limit()
where -ECANCELLED is returned when the start limit is hit, which is part
of unit_start()'s protocol of return values.

Since the emergency_action() logic should be relatively generic and is
used in many places, let's drop the return value from it, since it's
constant anyway, and in alll cases useless. Instead, let's return it in
unit_test_start_limit(), where it's part of the protocol.

No change in behaviour.
2019-03-18 16:06:36 +01:00
Lennart Poettering 2de9b9793b core: check start limit on condition checks too
Let's add a safety precaution: if the start condition checks for a unit
are tested too often and fail each time, let's rate limit this too.

This should add extra safety in case people define .path, .timer or
.automount units that trigger a service that as a conditoin that always
fails.
2019-03-18 16:06:36 +01:00
Lennart Poettering 5766aca8d2 core: modernize unit_start() a bit
No change in behaviour, just a re-line-breaking of the various comments
to our current coding style, and some use of SYNTHETIC_ERRNO().
2019-03-18 16:06:36 +01:00
Lennart Poettering a4191c9fb5 core: unify code for checking whether unit to trigger is loaded 2019-03-18 16:06:36 +01:00
Lennart Poettering 97a3f4ee05 core: rename unit_{start_limit|condition|assert}_test() to unit_test_xyz()
Just some renaming, no change in behaviour.

Background: I'd like to add more functions unit_test_xyz() that test
various things, hence let's streamline the naming a bit.
2019-03-18 16:06:36 +01:00
Lennart Poettering 9e30cf74ce core: add comment explaining ECOMM return value of unit_start()
we explain all other return values, explain these ones too.
2019-03-18 16:06:36 +01:00
Stephane Chazelas 106bf8e445 remove "." path components from required mount paths
unit_require_mounts_for may be passed path arguments that contain "."
components like for user's home directories where "." is sometimes used
to specify some form of anchor point.

This change stops considering such path as an error and removes the "."
components instead.

Closes: #11910
2019-03-07 10:12:03 +01:00
Lennart Poettering 5bcffb4b54
Merge pull request #11457 from grooverdan/sendsigkill_no
service: killmode=cgroup|mixed, SendSIGKILL=no services are not multiprocess
2019-02-18 13:41:52 +01:00
Daniel Black c53d2d54bd service: make killmode=cgroup|mixed, SendSIGKILL=no services singletons
KillMode=mixed and control group are used to indicate that all
process should be killed off. SendSIGKILL is used for services
that require a clean shutdown. These are typically database
service where a SigKilled process would result in a lengthy
recovery and who's shutdown or startup time is quite variable
(so Timeout settings aren't of use).

Here we take these two factors and refuse to start a service if
there are existing processes within a control group. Databases,
while generally having some protection against multiple instances
running, lets not stress the rigor of these. Also ExecStartPre
parts of the service aren't as rigoriously written to protect
against against multiple use.

closes #8630
2019-01-29 15:35:59 +11:00
Jonathon Kowalski 6255af75d7 Return -EAGAIN instead of -EALREADY from unit_reload
Fixes: #11499

Let's return -EAGAIN so that on state change, unit_process_job tries to
add our job to run_queue again so that all the reloads that coalesced
into the installed reload (which itself merged into a running one)
inititally atleast runs *once*. This should ensure service picks up all
config changes reliably.

See the issue being fixed for a detailed explanation.
2019-01-20 22:12:24 +00:00
Lennart Poettering 2d41e9b7a0
Merge pull request #11143 from keszybz/enable-symlink
Runtime mask symlink confusion fix
2018-12-16 12:37:07 +01:00
Zbigniew Jędrzejewski-Szmek 58d9d89b4b pid1: fix free of uninitialized pointer in unit_fail_if_noncanonical()
https://bugzilla.redhat.com/show_bug.cgi?id=1653068
2018-12-14 11:21:16 +01:00
Zbigniew Jędrzejewski-Szmek 303ee60151 Mark *data and *userdata params to specifier_printf() as const
It would be very wrong if any of the specfier printf calls modified
any of the objects or data being printed. Let's mark all arguments as const
(primarily to make it easier for the reader to see where modifications cannot
occur).
2018-12-12 16:45:33 +01:00
Lennart Poettering a1c7334b61 core: when a unit state changes only propagate to jobs after reloading is complete
Previously, we'd immediately propagate unit state changes into any jobs
pending for them, always. With this we only do this if the manager is
out of the "reload" state. This fixes the problem #8803 tried to
address, by simply not completing jobs until after the reload (and thus
reestablishment of the dbus connection) is complete.

Note that there's no need to later on explicitly catch up with the
missed job state changes (i.e. there's no need to call
unit_process_job() later one explicitly). That's because for jobs in
JOB_WAITING state on deserialization all jobs are requeued into the run
queue anyway, and thus checked again if they can complete now. And for
JOB_RUNNING jobs unit_catchup() phase is going to trigger missed out
state changes *after* the reload complete anyway (after all that's what
distinguishes from unit_coldplug()).

Replaces: #8803
2018-12-12 11:15:07 +01:00
Lennart Poettering 16c74914d2 core: split out all logic that updates a Job on a unit's unit_notify() invocation
Just some refactoring, no change in behaviour.
2018-12-12 11:15:07 +01:00
Lennart Poettering b17c9620c8 core: rework how we deserialize jobs
Let's add a helper call unit_deserialize_job() for this purpose, and
let's move registration in the global jobs hash table into
job_install_deserialized() so that it it is done after all superficial
checks are done, and before transitioning into installed states, so that
rollback code is not necessary anymore.
2018-12-12 11:15:07 +01:00
Zbigniew Jędrzejewski-Szmek 4cb06c5949 Use VLA instead of alloca
The test is the same, but an array is more readable.
2018-12-10 11:57:26 +01:00
Zbigniew Jędrzejewski-Szmek 2d479ff1cc
Merge pull request #10963 from poettering/bus-force-state-change-signal
force PropertiesChanged bus signal on all unit state changes
2018-12-06 16:42:21 +01:00
Lennart Poettering e4de72876e util-lib: split out all temporary file related calls into tmpfiles-util.c
This splits out a bunch of functions from fileio.c that have to do with
temporary files. Simply to make the header files a bit shorter, and to
group things more nicely.

No code changes, just some rearranging of source files.
2018-12-02 13:22:29 +01:00
Lennart Poettering ee228be10c util-lib: don't include fileio.h from fileio-label.h
There's no reason for doing that, hence simply don't.
2018-12-02 13:22:29 +01:00
Lennart Poettering 3c4832ada4 core: enqueue unit earlier when state changes
Previously, we'd enqueue a unit to the dbus queue whenever the state
changed, after we processed the state change fully. This commit to the
beginning of the state change. This has the benefit that when the state
change causes a job to complete the unit is already in the dbus queue,
and thus we get the guarantee that any unit change can be sent out to
clients before the job change.
2018-12-01 12:53:26 +01:00
Lennart Poettering af92c603bb core: send out unit change events when a new invocation ID is acquired
It's free, as this generally coincides with unit_start(), but let's make
this clean and explicit.
2018-12-01 12:53:26 +01:00
Lennart Poettering e18f8852f3 core: invalidate invidual Assert/Condition properties when sending out change messages
Let's inform the clients about assert/condition property changes as they
happen, it's basically for free because assert/condition property
changes generally coincide with other unit state changes (after all
these checks are done on unit_start())
2018-12-01 12:53:26 +01:00
Lennart Poettering 37d0b962ef core: when we manage to resolve a user, only enqueue dbus event, don't send out message right-away
Let's only enqueue the dbus signal generation, let's not do it
right-away, after all we want coalescing to take effect here.
2018-12-01 12:53:26 +01:00
Zbigniew Jędrzejewski-Szmek 8b4e51a60e
Merge pull request #10797 from poettering/run-generator
add new "systemd-run-generator" for running arbitrary commands from the kernel command line as system services using the "systemd.run=" kernel command line switch
2018-11-28 22:40:55 +01:00
Lennart Poettering 7af67e9a8b core: allow to set exit status when using SuccessAction=/FailureAction=exit in units
This adds SuccessActionExitStatus= and FailureActionExitStatus= that may
be used to configure the exit status to propagate in when
SuccessAction=exit or FailureAction=exit is used.

When not specified let's also propagate the exit status of the main
process we fork off for the unit.
2018-11-27 09:44:40 +01:00
Lennart Poettering 5b262f74e4 unit: tweak status output a bit
Let's highlight the unit description string in the status updates, to
separate them a bit more the english sentence they are part of, and thus
make the different casing less surprising.
2018-11-26 18:24:12 +01:00
Lennart Poettering b8b6f32104 cgroup: when we unload a unit, also update all its parent's members mask
This way we can corectly ensure that when a unit that requires some
controller goes away, we propagate the removal of it all the way up, so
that the controller is turned off in all the parents too.
2018-11-23 13:41:37 +01:00
Lennart Poettering 5af8805872 cgroup: drastically simplify caching of cgroups members mask
Previously we tried to be smart: when a new unit appeared and it only
added controllers to the cgroup mask we'd update the cached members mask
in all parents by ORing in the controller flags in their cached values.
Unfortunately this was quite broken, as we missed some conditions when
this cache had to be reset (for example, when a unit got unloaded),
moreover the optimization doesn't work when a controller is removed
anyway (as in that case there's no other way for the parent to iterate
though all children if any other, remaining child unit still needs it).
Hence, let's simplify the logic substantially: instead of updating the
cache on the right events (which we didn't get right), let's simply
invalidate the cache, and generate it lazily when we encounter it later.
This should actually result in better behaviour as we don't have to
calculate the new members mask for a whole subtree whever we have the
suspicion something changed, but can delay it to the point where we
actually need the members mask.

This allows us to simplify things quite a bit, which is good, since
validating this cache for correctness is hard enough.

Fixes: #9512
2018-11-23 13:41:37 +01:00
Lennart Poettering 0adf88b68c cgroup: dump delegation mask too 2018-11-23 12:24:37 +01:00
Lennart Poettering 00e7b3c8e5 unit: minor optimization, use stack over heap, when we can 2018-11-23 00:46:56 +01:00
Lennart Poettering 66fa4bdd70 core: add two minor comments (#10890) 2018-11-23 06:25:27 +09:00
Lennart Poettering 6e64994d69 core: make unit_start() return a distinguishable error code in case conditions didn't hold
Ideally we'd even propagate this all the way to the client, by having a
separate JobType enum value for this. But it's hard to add this without
breaking compat, hence for now let's at least internally propagate this
case differently from the case "already on it".

This is then used to call job_finish_and_invalidate() slightly
differently, with the already= parameter false, as in the failed
condition case no message was likely produced so far.
2018-11-16 15:22:48 +01:00
Lennart Poettering 523ee2d414 core: log a recognizable message when a unit succeeds, too
We already are doing it on failure, let's do it on success, too.

Fixes: #10265
2018-11-16 15:22:48 +01:00
Lennart Poettering 91bbd9b796 core: make log messages about unit processes exiting recognizable 2018-11-16 15:22:48 +01:00
Lennart Poettering 7c047d7443 core: make log messages about units entering a 'failed' state recognizable
Let's make this recognizable, and carry result information in a
structure fashion.
2018-11-16 15:22:48 +01:00
Lennart Poettering 33a3fdd978 core: move unit_status_emit_starting_stopping_reloading() and related calls to job.c
This call is only used by job.c and very specific to job handling.
Moreover the very similar logic of job_emit_status_message() is already
in job.c.

Hence, let's clean this up, and move both sets of functions to job.c,
and rename them a bit so that they express precisely what they do:

1. unit_status_emit_starting_stopping_reloading() →
   job_emit_begin_status_message()
2. job_emit_status_message() → job_emit_done_status_message()

The first call is after all what we call when we begin with the
execution of a job, and the second call what we call when we are done
wiht it.

Just some moving and renaming, not other changes, and hence no change in
behaviour.
2018-11-16 15:22:48 +01:00
Lennart Poettering 8204470252 unit: don't claim there was no IP traffic generated by a unit when we don't know
Only if we have some IP traffic accounting at all we should claim that.
2018-11-14 09:53:50 +01:00
Lennart Poettering 6eb65e7ca4 core: split out audit message generation from unit_notify()
Just some refactoring, no change in behaviour.
2018-11-14 09:51:47 +01:00
INSUN PYO 8724defeae core: use local variable m instead of u->manager 2018-11-13 10:39:35 +01:00
Tommi Rantala 429926e9cc core: include unit name in emergency_action() reason message
Add unit name in StartLimitAction=, FailureAction= and SuccessAction=
emergency_action() reason messages, so that the problematic unit is
easily visible, for example:

    "unit dbus.service failed"
2018-11-12 16:36:03 +01:00
Lennart Poettering 1ad6e8b302 core: split environment block mantained by PID 1's Manager object in two
This splits the "environment" field of Manager into two:
transient_environment and client_environment. The former is generated
from configuration file, kernel cmdline, environment generators. The
latter is the one the user can control with "systemctl set-environment"
and similar.

Both sets are merged transparently whenever needed. Separating the two
sets has the benefit that we can safely flush out the former while
keeping the latter during daemon reload cycles, so that env var settings
from env generators or configuration files do not accumulate, but
dynamic API changes are kept around.

Note that this change is not entirely transparent to users: if the user
first uses "set-environment" to override a transient variable, and then
uses "unset-environment" to unset it again things will revert to the
original transient variable now, while previously the variable was fully
removed. This change in behaviour should not matter too much though I
figure.

Fixes: #9972
2018-10-31 18:00:53 +01:00
Lennart Poettering d68c645bd3 core: rework serialization
Let's be more careful with what we serialize: let's ensure we never
serialize strings that are longer than LONG_LINE_MAX, so that we know we
can read them back with read_line(…, LONG_LINE_MAX, …) safely.

In order to implement this all serialization functions are move to
serialize.[ch], and internally will do line size checks. We'd rather
skip a serialization line (with a loud warning) than write an overly
long line out. Of course, this is just a second level protection, after
all the data we serialize shouldn't be this long in the first place.

While we are at it also clean up logging: while serializing make sure to
always log about errors immediately. Also, (void)ify all calls we don't
expect errors in (or catch errors as part of the general
fflush_and_check() at the end.
2018-10-26 10:52:41 +02:00
Lennart Poettering 8948b3415d core: when deserializing state always use read_line(…, LONG_LINE_MAX, …)
This should be much better than fgets(), as we can read substantially
longer lines and overly long lines result in proper errors.

Fixes a vulnerability discovered by Jann Horn at Google.

CVE-2018-15686
LP: #1796402
https://bugzilla.redhat.com/show_bug.cgi?id=1639071
2018-10-26 10:40:01 +02:00
Martin Wilck e1e74614aa core: don't create Requires for workdir if "missing ok"
Don't add an implicit RequiresMountsFor depenency for the
WorkingDirectory of a unit if the "-" character was used to
indicate that "a missing working directory is not considered fatal"
(see systemd.exec(5)). Otherwise systemd might fail the unit
because of missing dependencies.
2018-10-25 11:35:59 +02:00
Yu Watanabe ec9d636b37 core: use ascii_toupper() instead of everytime judging whether it is the first message 2018-10-24 04:58:08 +09:00
Lennart Poettering a87b1faad3 core: beautify per-unit consumed resources log message a bit. (#10390)
Shorten message to say "no IP traffic" if there is no IP traffic, rather
than "received 0B IP traffic, sent 0B IP traffic".

Fixes: #9816
2018-10-19 09:04:12 +09:00
Anita Zhang 90fc172e19 core: implement per unit journal rate limiting
Add LogRateLimitIntervalSec= and LogRateLimitBurst= options for
services. If provided, these values get passed to the journald
client context, and those values are used in the rate limiting
function in the journal over the the journald.conf values.

Part of #10230
2018-10-18 09:56:20 +02:00
Zbigniew Jędrzejewski-Szmek c7adcb1af9 core: do not "warn" about mundane emergency actions
For example in a container we'd log:
Oct 17 17:01:10 rawhide systemd[1]: Started Power-Off.
Oct 17 17:01:10 rawhide systemd[1]: Forcibly powering off: unit succeeded
Oct 17 17:01:10 rawhide systemd[1]: Reached target Power-Off.
Oct 17 17:01:10 rawhide systemd[1]: Shutting down.
and on the console we'd write (in red)
[  !!  ] Forcibly powering off: unit succeeded

This is not useful in any way, and the fact that we're calling an "emergency action"
is an internal implementation detail. Let's log about c-a-d and the watchdog actions
only.
2018-10-17 19:32:09 +02:00
Zbigniew Jędrzejewski-Szmek 1710d4beff core: limit service-watchdogs=no to actual "watchdog" commands
The setting is now only looked at when considering an action for a job timeout
or unit start limit. It is ignored for ctrl-alt-del, SuccessAction, SuccessFailure.

v2: turn the parameter into a flag field
v3: rename Options to Flags
2018-10-17 19:31:50 +02:00
Lennart Poettering 93d4cb09d5 core: fix unfortunate typo in unit_is_unneeded()
Follow-up for a3c1168ac2.
2018-10-13 13:01:08 +02:00
Zbigniew Jędrzejewski-Szmek f436470ae1
Merge pull request #10343 from poettering/manager-state-fix
various fixes for PID1's Manager object
2018-10-10 12:36:16 +02:00
Lennart Poettering 3316429f19
Merge pull request #10062 from rgushchin/device
Support cgroup v2 bpf-based device controller
2018-10-09 23:29:27 +02:00
Lennart Poettering 5f616d5feb core: add missing 'continue' statement 2018-10-09 21:11:06 +02:00
Lennart Poettering 638cece45d core: clean up test run flags
Let's make them typesafe, and let's add a nice macro helper for checking
if we are in a test run, which should make testing for this much easier
to read for most cases.
2018-10-09 19:43:43 +02:00
Roman Gushchin 084c700780 core: support cgroup v2 device controller
Cgroup v2 provides the eBPF-based device controller, which isn't currently
supported by systemd. This commit aims to provide such support.

There are no user-visible changes, just the device policy and whitelist
start working if cgroup v2 is used.
2018-10-09 09:47:51 -07:00
Roman Gushchin 17f149556a core: refactor bpf firewall support into a pseudo-controller
The idea is to introduce a concept of bpf-based pseudo-controllers
to make adding new bpf-based features easier.
2018-10-09 09:46:08 -07:00
Lennart Poettering 0e699122b7 core: properly serialize "in_audit" per-unit boolean
Fixes: #9962
2018-10-09 10:09:39 +02:00
Lennart Poettering 256f65d045 core: rearrange conditions in unit_notify() a bit
This shouldn't change control flow, with one exception: we won't send
notifications for boot progress to plymouth anymore during reload, which
is something we really shouldn't.
2018-10-09 10:09:39 +02:00
Lennart Poettering 334415b16e
Merge pull request #10094 from keszybz/wants-loading
Fix bogus fragment paths in units in .wants/.requires
2018-10-05 17:36:31 +02:00
Anita Zhang c87700a133 Make Watchdog Signal Configurable
Allows configuring the watchdog signal (with a default of SIGABRT).
This allows an alternative to SIGABRT when coredumps are not desirable.

Appropriate references to SIGABRT or aborting were renamed to reflect
more liberal watchdog signals.

Closes #8658
2018-09-26 16:14:29 +02:00
Zbigniew Jędrzejewski-Szmek 23e8c79665 pid1: drop now-unused path parameter to resolve_template() 2018-09-15 20:03:32 +02:00
Zbigniew Jędrzejewski-Szmek 5a72417084 pid1: drop unused path parameter to add_two_dependencies_by_name() 2018-09-15 20:02:00 +02:00
Zbigniew Jędrzejewski-Szmek 35d8c19ace pid1: drop now-unused path parameter to add_dependency_by_name() 2018-09-15 19:57:52 +02:00
Zbigniew Jędrzejewski-Szmek fda09318e3 core: rename function to better reflect semantics 2018-08-20 10:43:31 +02:00
Lennart Poettering a3c1168ac2 core: rework StopWhenUnneeded= logic
Previously, we'd act immediately on StopWhenUnneeded= when a unit state
changes. With this rework we'll maintain a queue instead: whenever
there's the chance that StopWhenUneeded= might have an effect we enqueue
the unit, and process it later when we have nothing better to do.

This should make the implementation a bit more reliable, as the unit notify event
cannot immediately enqueue tons of side-effect jobs that might
contradict each other, but we do so only in a strictly ordered fashion,
from the main event loop.

This slightly changes the check when to consider a unit "unneeded".
Previously, we'd assume that a unit in "deactivating" state could also
be cleaned up. With this new logic we'll only consider units unneeded
that are fully up and have no job queued. This means that whenever
there's something pending for a unit we won't clean it up.
2018-08-10 16:19:01 +02:00
Yu Watanabe fe65e88ba6 namespace: implicitly adds DeviceAllow= when RootImage= is set
RootImage= may require the following settings
```
DeviceAllow=/dev/loop-control rw
DeviceAllow=block-loop rwm
DeviceAllow=block-blkext rwm
```
This adds the following settings implicitly when RootImage= is
specified.

Fixes #9737.
2018-08-06 14:02:31 +09:00
Jon Ringle fbb48d4c66 Make final kill signal configurable
Usecase is to allow changing the final kill from SIGKILL to SIGQUIT which
should create a core dump useful for debugging why the service didn't stop
with the SIGTERM
2018-07-23 13:44:54 +02:00
Chris Lamb 3fe910794b Correct a number of trivial typos. 2018-06-18 22:44:44 +02:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Lennart Poettering 6f40aa4547 core: add a couple of more error cases that should result in "bad-setting"
This changes a number of EINVAL cases to ENOEXEC, so that we enter
"bad-setting" state if they fail.
2018-06-11 12:53:12 +02:00
Lennart Poettering c4555ad8f6 core: introduce a new load state "bad-setting"
Since bb28e68477 parsing failures of
certain unit file settings will result in load failures of units. This
introduces a new load state "bad-setting" that is entered in precisely
this case.

With this addition error messages on bad settings should be a lot more
explicit, as we don't have to show some generic "errno" error in that
case, but can explicitly say that a bad setting is at fault.

Internally this unit load state is entered as soon as any configuration
loader call returns ENOEXEC. Hence: config parser calls should return
ENOEXEC now for such essential unit file settings. Turns out, they
generally already do.

Fixes: #9107
2018-06-11 12:53:12 +02:00
Lennart Poettering f0831ed2a0 core: add a new unit method "catchup()"
This is very similar to the existing unit method coldplug() but is
called a bit later. The idea is that that coldplug() restores the unit
state from before any prior reload/restart, i.e. puts the deserialized
state in effect. The catchup() call is then called a bit later, to
catch up with the system state for which we missed notifications while
we were reloading. This is only really useful for mount, swap and device
mount points were we should be careful to generate all missing unit
state change events (i.e. call unit_notify() appropriately) for
everything that happened while we were reloading.
2018-06-07 15:28:50 +02:00
Lennart Poettering 50be4f4a46 core: rework how we track service and scope PIDs
This reworks how systemd tracks processes on cgroupv1 systems where
cgroup notification is not reliable. Previously, whenever we had reason
to believe that new processes showed up or got removed we'd scan the
cgroup of the scope or service unit for new processes, and would tidy up
the list of PIDs previously watched. This scanning is relatively slow,
and does not scale well. With this change behaviour is changed: instead
of scanning for new/removed processes right away we do this work in a
per-unit deferred event loop job. This event source is scheduled at a
very low priority, so that it is executed when we have time but does not
starve other event sources. This has two benefits: this expensive work is
coalesced, if events happen in quick succession, and we won't delay
SIGCHLD handling for too long.

This patch basically replaces all direct invocation of
unit_watch_all_pids() in scope.c and service.c with invocations of the
new unit_enqueue_rewatch_pids() call which just enqueues a request of
watching/tidying up the PID sets (with one exception: in
scope_enter_signal() and service_enter_signal() we'll still do
unit_watch_all_pids() synchronously first, since we really want to know
all processes we are about to kill so that we can track them properly.

Moreover, all direct invocations of unit_tidy_watch_pids() and
unit_synthesize_cgroup_empty_event() are removed too, when the
unit_enqueue_rewatch_pids() call is invoked, as the queued job will run
those operations too.

All of this is done on cgroupsv1 systems only, and is disabled on
cgroupsv2 systems as cgroup-empty notifications are reliable there, and
we do not need SIGCHLD events to track processes there.

Fixes: #9138
2018-06-05 22:06:48 +02:00
Zbigniew Jędrzejewski-Szmek 79e221d078
Merge pull request #9158 from poettering/notify-auto-reload
trigger OnFailure= only if Restart= is not in effect
2018-06-05 13:51:07 +02:00
Zbigniew Jędrzejewski-Szmek a1230ff972 basic/log: add the log_struct terminator to macro
This way all callers do not need to specify it.
Exhaustively tested by running test-log under valgrind ;)
2018-06-04 13:46:03 +02:00
Zbigniew Jędrzejewski-Szmek d94a24ca2e Add macro for checking if some flags are set
This way we don't need to repeat the argument twice.
I didn't replace all instances. I think it's better to leave out:
- asserts
- comparisons like x & y == x, which are mathematically equivalent, but
  here we aren't checking if flags are set, but if the argument fits in the
  flags.
2018-06-04 11:50:44 +02:00
Yu Watanabe 858d36c1ec path-util: introduce path_simplify()
The function is similar to path_kill_slashes() but also removes
initial './', trailing '/.', and '/./' in the path.
When the second argument of path_simplify() is false, then it
behaves as the same as path_kill_slashes(). Hence, this also
replaces path_kill_slashes() with path_simplify().
2018-06-03 23:39:26 +09:00
Lennart Poettering 2ad2e41a72 core: don't trigger OnFailure= deps when a unit is going to restart
This adds a flags parameter to unit_notify() which can be used to pass
additional notification information to the function. We the make the old
reload_failure boolean parameter one of these flags, and then add a new
flag that let's unit_notify() if we are configured to restart the
service.

Note that this adjusts behaviour of systemd to match what the docs say.

Fixes: #8398
2018-06-01 19:08:30 +02:00
Lennart Poettering 7f66b026bb core: when we can't enqueue OnFailure= job show full error message
Let's ask for the full error message and show it, there's really no
reason to just show the crappy errno error.
2018-06-01 19:04:37 +02:00
Lennart Poettering 6f8fa29465
Merge pull request #8981 from keszybz/ratelimit-and-dbus
Ratelimit renaming and dbus error message fix
2018-05-18 21:38:30 +02:00
Felipe Sateler 57b7a260c2 core: undo the dependency inversion between unit.h and all unit types 2018-05-15 14:24:34 -04:00
Yu Watanabe af4fa99d6a core: use _cleanup_set_free_ instread of _cleanup_(set_freep) 2018-05-14 14:13:57 +09:00
Zbigniew Jędrzejewski-Szmek 7994ac1d85 Rename ratelimit_test to ratelimit_below
When I see "test", I have to think three times what the return value
means. With "below" this is immediately clear. ratelimit_below(&limit)
sounds almost like English and is imho immediately obvious.

(I also considered ratelimit_ok, but this strongly implies that being under the
limit is somehow better. Most of the times this is true, but then we use the
ratelimit to detect triple-c-a-d, and "ok" doesn't fit so well there.)

C.f. a1bcaa07.
2018-05-13 22:08:30 +02:00
David Tardon 95f14a3e21 core: use automatic cleanup more 2018-05-12 18:29:41 +02:00
Lennart Poettering d4fd1cf208 core: enforce that scope units can be started only once
Scope units are populated from PIDs specified by the bus client. We do
that when a scope is started. We really shouldn't allow scopes to be
started multiple times, as the PIDs then might be heavily out of date.
Moreover, clients should have the guarantee that any scope they allocate
has a clear runtime cycle which is not repetitive.
2018-04-27 21:52:45 +02:00
Lennart Poettering 7a9a0c05d4
Merge pull request #8765 from poettering/test-fixes
some short fixes for the tests
2018-04-19 16:18:46 +02:00
Lennart Poettering 5d13a15b1d tree-wide: drop spurious newlines (#8764)
Double newlines (i.e. one empty lines) are great to structure code. But
let's avoid triple newlines (i.e. two empty lines), quadruple newlines,
quintuple newlines, …, that's just spurious whitespace.

It's an easy way to drop 121 lines of code, and keeps the coding style
of our sources a bit tigther.
2018-04-19 12:13:23 +02:00
Lennart Poettering 8f63253149 core: don't export per-unit metadata files in test mode
We shouldn't clobber the host's /run directories with metadata we export
for our units when we run in test mode.
2018-04-19 11:30:18 +02:00
Lennart Poettering 4d09e1c8ba
Merge pull request #8676 from keszybz/drop-license-boilerplate
Drop license boilerplate
2018-04-10 14:53:31 +02:00
Zbigniew Jędrzejewski-Szmek e9e8cbc83a core: minor comment update 2018-04-07 20:05:58 +02:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Yu Watanabe 1cc6c93a95 tree-wide: use TAKE_PTR() and TAKE_FD() macros 2018-04-05 14:26:26 +09:00
Michal Sekletar 19496554e2 core: delay adding target dependencies until all units are loaded and aliases resolved (#8381)
Currently we add target dependencies while we are loading units. This
can create ordering loops even if configuration doesn't contain any
loop. Take for example following configuration,

$ systemctl get-default
multi-user.target

$ cat /etc/systemd/system/test.service
[Unit]
After=default.target

[Service]
ExecStart=/bin/true

[Install]
WantedBy=multi-user.target

If we encounter such unit file early during manager start-up (e.g. load
queue is dispatched while enumerating devices due to SYSTEMD_WANTS in
udev rules) we would add stub unit default.target and we order it Before
test.service. At the same time we add implicit Before to
multi-user.target. Later we merge two units and we create ordering cycle
in the process.

To fix the issue we will now never add any target dependencies until we
loaded all the unit files and resolved all the aliases.
2018-03-23 15:28:06 +01:00
Lennart Poettering ae2a15bc14 macro: introduce TAKE_PTR() macro
This macro will read a pointer of any type, return it, and set the
pointer to NULL. This is useful as an explicit concept of passing
ownership of a memory area between pointers.

This takes inspiration from Rust:

https://doc.rust-lang.org/std/option/enum.Option.html#method.take

and was suggested by Alan Jenkins (@sourcejedi).

It drops ~160 lines of code from our codebase, which makes me like it.
Also, I think it clarifies passing of ownership, and thus helps
readability a bit (at least for the initiated who know the new macro)
2018-03-22 20:21:42 +01:00
Lennart Poettering 31dc1ca3bf move MANAGER_IS_RELOADING() check into manager_recheck_{dbus|journal}() (#8510)
Let's better check this inside of the call than before it, so that we
never issue this while reloading, even should these calls be called due
to other reasons than just the unit notify.

This makes sure the reload state is unset a bit earlier in
manager_reload() so that we can safely call this function from there and
they do the right thing.

Follow-up for e63ebf71ed.
2018-03-21 12:03:45 +01:00
Evgeny Vereshchagin e4711004d6
Merge pull request #8461 from keszybz/oss-fuzz-fixes
Oss fuzz fixes
2018-03-19 00:06:44 +03:00
Zbigniew Jędrzejewski-Szmek ca8700e922 core/unit: delay creating a stack variable until after length has been checked
path_is_normalized() will reject paths longer than 4095 bytes, so it's better
to not create a stack variable of unbounded size, but instead do the check first
and only then do that allocation.

Also use _cleanup_ to make things a bit shorter.

https://oss-fuzz.com/v2/issue/5424177403133952/7000
2018-03-18 21:07:01 +01:00
Zbigniew Jędrzejewski-Szmek e63ebf71ed core: when reloading, delay any actions on journal and dbus connections
manager_recheck_journal() and manager_recheck_dbus() would be called to early
while we were deserialiazing units, before the systemd-journald.service and
dbus.service have been deserialized. In effect we'd disable logging to the
journald and close the bus connection. The first is not very noticable, it
mostly means that logs emitted during deserialization are lost. The second is
more noticeable, because manager_recheck_dbus() would call bus_done_api() and
bus_done_system() and close dbus connections. Logging and bus connection would
then be restored later after the respective units have been deserialized.

This is easily reproduced by calling:
  $ sudo gdbus call --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1 --method "org.freedesktop.systemd1.Manager.Reload"
which works fine before 8559b3b75c, and then starts failing with:
  Error: GDBus.Error:org.freedesktop.DBus.Error.NoReply: Remote peer disconnected

None of this should happen, and we should delay changing state until after
deserialization is complete when reloading. manager_reload() already included
the calls to manager_recheck_journal() and manager_recheck_dbus(), so the
connection state will be updated after deserialization during reloading is done.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1554578.
2018-03-16 23:14:04 +01:00
Zbigniew Jędrzejewski-Szmek dc409696cf Introduce _cleanup_(unit_freep) 2018-03-11 16:33:58 +01:00
Zbigniew Jędrzejewski-Szmek bea28c5adb core/unit: voidify one snprintf statement
One more follow-up for f810b631cd.
2018-02-26 15:49:27 +01:00
Zbigniew Jędrzejewski-Szmek f810b631cd Revert "Replace use of snprintf with xsprintf"
This reverts commit a7419dbc59.

_All_ changes in that commit were wrong.

Fixes #8211.
2018-02-23 00:13:52 +01:00
Lennart Poettering aa2b6f1d2b bpf: rework how we keep track and attach cgroup bpf programs
So, the kernel's management of cgroup/BPF programs is a bit misdesigned:
if you attach a BPF program to a cgroup and close the fd for it it will
stay pinned to the cgroup with no chance of ever removing it again (or
otherwise getting ahold of it again), because the fd is used for
selecting which BPF program to detach. The only way to get rid of the
program again is to destroy the cgroup itself.

This is particularly bad for root the cgroup (and in fact any other
cgroup that we cannot realistically remove during runtime, such as
/system.slice, /init.scope or /system.slice/dbus.service) as getting rid
of the program only works by rebooting the system.

To counter this let's closely keep track to which cgroup a BPF program
is attached and let's implicitly detach the BPF program when we are
about to close the BPF fd.

This hence changes the bpf_program_cgroup_attach() function to track
where we attached the program and changes bpf_program_cgroup_detach() to
use this information. Moreover bpf_program_unref() will now implicitly
call bpf_program_cgroup_detach().

In order to simplify things, bpf_program_cgroup_attach() will now
implicitly invoke bpf_program_load_kernel() when necessary, simplifying
the caller's side.

Finally, this adds proper reference counting to BPF programs. This
is useful for working with two BPF programs in parallel: the BPF program
we are preparing for installation and the BPF program we so far
installed, shortening the window when we detach the old one and reattach
the new one.
2018-02-21 16:43:36 +01:00
Lennart Poettering 00f5ad93b5 core: change KeyringMode= to "shared" by default for non-service units in the system manager (#8172)
Before this change all unit types would default to "private" in the
system service manager and "inherit" to in the user service manager.

With this change this is slightly altered: non-service units of the
system service manager are now run with KeyringMode=shared. This appears
to be the more appropriate choice as isolation is not as desirable for
mount tools, which regularly consume key material. After all mounts are
a shared resource themselves as they appear system-wide hence it makes a
lot of sense to share their key material too.

Fixes: #8159
2018-02-20 08:53:34 +01:00
Lennart Poettering 30663b6c25
Merge pull request #8199 from keszybz/small-things
Sundry small cleanups
2018-02-19 16:55:10 +01:00
Zbigniew Jędrzejewski-Szmek f4aa0bde1c core: drop obsolete comment
https://github.com/systemd/systemd/pull/8125#pullrequestreview-96894581
2018-02-19 15:18:54 +01:00
Lennart Poettering a94ab7acfd
Merge pull request #8175 from keszybz/gc-cleanup
Garbage collection cleanup
2018-02-15 17:47:37 +01:00
Zbigniew Jędrzejewski-Szmek 648461c07d Merge pull request #8125 from poettering/cgroups-migrate
Trivial merge conflict resolved locally.
2018-02-15 16:15:45 +01:00
Zbigniew Jędrzejewski-Szmek 1bdf279002 pid1: properly remove references to the unit from gc queue during final cleanup
When various references to the unit were dropped during cleanup in unit_free(),
add_to_gc_queue() could be called on this unit. If the unit was previously in
the gc queue (at the time when unit_free() was called on it), this wouldn't
matter, because it'd have in_gc_queue still set even though it was already
removed from the queue. But if it wasn't set, then the unit could be added to
the queue. Then after unit_free() would deallocate the unit, we would be left
with a dangling pointer in gc_queue.

A unit could be added to the gc queue in two places called from unit_free():
in the job_install calls, and in unit_ref_unset(). The first was OK, because
it was above the LIST_REMOVE(gc_queue,...) call, but the second was not, because
it was after that. Move the all LIST_REMOVE() calls down.
2018-02-15 14:03:53 +01:00
Zbigniew Jędrzejewski-Szmek a946fa9bb9 pid1: free basic unit information at the very end, before freeing the unit
We would free stuff like the names of the unit first, and then recurse
into other structures to remove the unit from there. Technically this
was OK, since the code did not access the name, but this makes debugging
harder. And if any log messages are added in any of those functions, they
are likely to access u->id and such other basic information about the unit.
So let's move the removal of this "basic" information towards the end
of unit_free().
2018-02-15 13:32:59 +01:00
Zbigniew Jędrzejewski-Szmek 2641f02e23 pid1: fix collection of cycles of units which reference one another
A .socket will reference a .service unit, by registering a UnitRef with the
.service unit. If this .service unit has the .socket unit listed in Wants or
Sockets or such, a cycle will be created. We would not free this cycle
properly, because we treated any unit with non-empty refs as uncollectable. To
solve this issue, treats refs with UnitRef in u->refs_by_target similarly to
the refs in u->dependencies, and check if the "other" unit is known to be
needed. If it is not needed, do not treat the reference from it as preventing
the unit we are looking at from being freed.
2018-02-15 13:32:53 +01:00
Zbigniew Jędrzejewski-Szmek 7f7d01ed58 pid1: include the source unit in UnitRef
No functional change.

The source unit manages the reference. It allocates the UnitRef structure and
registers it in the target unit, and then the reference must be destroyed
before the source unit is destroyed. Thus, is should be OK to include the
pointer to the source unit, it should be live as long as the reference exists.

v2:
- rename refs to refs_by_target
2018-02-15 13:27:06 +01:00
Zbigniew Jędrzejewski-Szmek f2f725e5cc pid1: rename unit_check_gc to unit_may_gc
"check" is unclear: what is true, what is false? Let's rename to "can_gc" and
revert the return value ("positive" values are easier to grok).

v2:
- rename from unit_can_gc to unit_may_gc
2018-02-15 13:04:12 +01:00
Lennart Poettering 6592b9759c core: add new new bus call for migrating foreign processes to scope/service units
This adds a new bus call to service and scope units called
AttachProcesses() that moves arbitrary processes into the cgroup of the
unit. The primary user for this new API is systemd itself: the systemd
--user instance uses this call of the systemd --system instance to
migrate processes if itself gets the request to migrate processes and
the kernel refuses this due to access restrictions.

The primary use-case of this is to make "systemd-run --scope --user …"
invoked from user session scopes work correctly on pure cgroupsv2
environments. There, the kernel refuses to migrate processes between two
unprivileged-owned cgroups unless the requestor as well as the ownership
of the closest parent cgroup all match. This however is not the case
between the session-XYZ.scope unit of a login session and the
user@ABC.service of the systemd --user instance.

The new logic always tries to move the processes on its own, but if
that doesn't work when being the user manager, then the system manager
is asked to do it instead.

The new operation is relatively restrictive: it will only allow to move
the processes like this if the caller is root, or the UID of the target
unit, caller and process all match. Note that this means that
unprivileged users cannot attach processes to scope units, as those do
not have "owning" users (i.e. they have now User= field).

Fixes: #3388
2018-02-12 11:34:00 +01:00
Lennart Poettering 8559b3b75c core: rework how we connect to the bus
This removes the current bus_init() call, as it had multiple problems:
it munged  handling of the three bus connections we care about (private,
"api" and system) into one, even though the conditions when which was
ready are very different. It also added redundant logging, as the
individual calls it called all logged on their own anyway.

The three calls bus_init_api(), bus_init_private() and bus_init_system()
are now made public. A new call manager_dbus_is_running() is added that
works much like manager_journal_is_running() and is a lot more careful
when checking whether dbus is around. Optionally it checks the unit's
deserialized_state rather than state, in order to accomodate for cases
where we cant to connect to the bus before deserializing the
"subscribed" list, before coldplugging the units.

manager_recheck_dbus() is added, that works a lot like
manager_recheck_journal() and is invoked in unit_notify(), i.e. when
units change state.

All in all this should make handling a bit more alike to journal
handling, and it also fixes one major bug: when running in user mode
we'll now connect to the system bus early on, without conditionalizing
this in anyway.
2018-02-12 11:34:00 +01:00
Lennart Poettering 004c7f169e core: fold manager_set_exec_params() into unit_set_exec_params()
Let's simplify things a bit: we so far called both functions every
single time, let's just merge one into the other, so that we have fewer
functions to call.
2018-02-12 11:34:00 +01:00
Lennart Poettering 1d9cc8768f cgroup: add a new "can_delegate" flag to the unit vtable, and set it for scope and service units only
Currently we allowed delegation for alluntis with cgroup backing
except for slices. Let's make this a bit more strict for now, and only
allow this in service and scope units.

Let's also add a generic accessor unit_cgroup_delegate() for checking
whether a unit has delegation turned on that checks the new bool first.

Also, when doing transient units, let's explcitly refuse turning on
delegation for unit types that don#t support it. This is mostly
cosmetical as we wouldn't act on the delegation request anyway, but
certainly helpful for debugging.
2018-02-12 11:34:00 +01:00
Lennart Poettering 548f69375e tree-wide: use path_hash_ops instead of string_hash_ops whenever we key by a path
Let's make use of our new hash_ops!
2018-02-12 11:07:55 +01:00
Franck Bui 9ea3a0e702 core: use id unit when retrieving unit file state (#8038)
Previous code was using the basename(id->fragment_path) which returned
incorrect result if the unit was an instance.

For example, assuming that no instances of "template" have been created so far:

 $ systemctl enable template@1
 Created symlink from /etc/systemd/system/multi-user.target.wants/template@1.service to /usr/lib/systemd/system/template@.service.

 $ systemctl is-enabled template@3.service
 disabled

 $ systemctl status template@3.servicetemplate@3.service - openQA Worker #3
    Loaded: loaded (/usr/lib/systemd/system/template@.service; enabled; vendor preset: disabled)
    [...]

Here the unit file states reported by "status" and "is-enabled" were different.
2018-02-07 14:08:02 +01:00
Andrei Gherzan 3f602115b7 core: Avoid empty directory warning when we are bind-mounting a file (#8069) 2018-02-06 16:35:52 +01:00
Yu Watanabe e8a565cb66 core: make ExecRuntime be manager managed object
Before this, each ExecRuntime object is owned by a unit. However,
it may be shared with other units which enable JoinsNamespaceOf=.
Thus, by the serialization/deserialization process, its sharing
information, more specifically, reference counter is lost, and
causes issue #7790.

This makes ExecRuntime objects be managed by manager, and changes
the serialization/deserialization process.

Fixes #7790.
2018-02-06 16:00:34 +09:00
Lennart Poettering 81e9871e87 selinux: make sure we never use /dev/null for making unit selinux access decisions 2018-01-31 19:54:25 +01:00
Lennart Poettering adefcf2821 core: rework how we count the n_on_console counter
Let's add a per-unit boolean that tells us whether our unit is currently
counted or not. This way it's unlikely we get out of sync again and
things are generally more robust.

This also allows us to remove the counting logic specific to service
units (which was in fact mostly a copy from the generic implementation),
in favour of fully generic code.

Replaces: #7824
2018-01-24 20:14:51 +01:00
Lennart Poettering bb2c768545 core: add a new unit_needs_console() call
This call determines whether a specific unit currently needs access to
the console. It's a fancy wrapper around
exec_context_may_touch_console() ultimately, however for service units
we'll explicitly exclude the SERVICE_EXITED state from when we report
true.
2018-01-24 19:54:26 +01:00
Lennart Poettering 62a769136d core: rework how we track which PIDs to watch for a unit
Previously, we'd maintain two hashmaps keyed by PIDs, pointing to Unit
interested in SIGCHLD events for them. This scheme allowed a specific
PID to be watched by exactly 0, 1 or 2 units.

With this rework this is replaced by a single hashmap which is primarily
keyed by the PID and points to a Unit interested in it. However, it
optionally also keyed by the negated PID, in which case it points to a
NULL terminated array of additional Unit objects also interested. This
scheme means arbitrary numbers of Units may now watch the same PID.

Runtime and memory behaviour should not be impact by this change, as for
the common case (i.e. each PID only watched by a single unit) behaviour
stays the same, but for the uncommon case (a PID watched by more than
one unit) we only pay with a single additional memory allocation for the
array.

Why this all? Primarily, because allowing exactly two units to watch a
specific PID is not sufficient for some niche cases, as processes can
belong to more than one unit these days:

1. sd_notify() with MAINPID= can be used to attach a process from a
   different cgroup to multiple units.

2. Similar, the PIDFile= setting in unit files can be used for similar
   setups,

3. By creating a scope unit a main process of a service may join a
   different unit, too.

4. On cgroupsv1 we frequently end up watching all processes remaining in
   a scope, and if a process opens lots of scopes one after the other it
   might thus end up being watch by many of them.

This patch hence removes the 2-unit-per-PID limit. It also makes a
couple of other changes, some of them quite relevant:

- manager_get_unit_by_pid() (and the bus call wrapping it) when there's
  ambiguity will prefer returning the Unit the process belongs to based on
  cgroup membership, and only check the watch-pids hashmap if that
  fails. This change in logic is probably more in line with what people
  expect and makes things more stable as each process can belong to
  exactly one cgroup only.

- Every SIGCHLD event is now dispatched to all units interested in its
  PID. Previously, there was some magic conditionalization: the SIGCHLD
  would only be dispatched to the unit if it was only interested in a
  single PID only, or the PID belonged to the control or main PID or we
  didn't dispatch a signle SIGCHLD to the unit in the current event loop
  iteration yet. These rules were quite arbitrary and also redundant as
  the the per-unit handlers would filter the PIDs anyway a second time.
  With this change we'll hence relax the rules: all we do now is
  dispatch every SIGCHLD event exactly once to each unit interested in
  it, and it's up to the unit to then use or ignore this. We use a
  generation counter in the unit to ensure that we only invoke the unit
  handler once for each event, protecting us from confusion if a unit is
  both associated with a specific PID through cgroup membership and
  through the "watch_pids" logic. It also protects us from being
  confused if the "watch_pids" hashmap is altered while we are
  dispatching to it (which is a very likely case).

- sd_notify() message dispatching has been reworked to be very similar
  to SIGCHLD handling now. A generation counter is used for dispatching
  as well.

This also adds a new test that validates that "watch_pid" registration
and unregstration works correctly.
2018-01-23 21:29:31 +01:00
Alan Jenkins 25cd49647c mount: forbid mount on path with symlinks
It was forbidden to create mount units for a symlink.  But the reason is
that the mount unit needs to know the real path that will appear in
/proc/self/mountinfo.  The kernel dereferences *all* the symlinks in the
path at mount time (I checked this with `mount -c` running under `strace`).

This will have no effect on most systems.  As recommended by docs, most
systems use /etc/fstab, as opposed to native mount unit files.
fstab-generator dereferences symlinks for backwards compatibility.

A relatively minor issue regarding Time Of Check / Time Of Use also exists
here.  I can't see how to get rid of it entirely.  If we pass an absolute
path to mount, the racing process can replace it with a symlink.  If we
chdir() to the mount point and pass ".", the racing process can move the
directory.  The latter might potentially be nicer, except that it breaks
WorkingDirectory=.

I'm not saying the race is relevant to security - I just want to consider
how bad the effect is.  Currently, it can make the mount unit active (and
hence the job return success), despite there never being a matching entry
in /proc/self/mountinfo.  This wart will be removed in the next commit;
i.e. it will make the mount unit fail instead.
2018-01-20 22:06:34 +00:00
Lennart Poettering 75152a4d6a tree-wide: install matches asynchronously
Let's remove a number of synchronization points from our service
startups: let's drop synchronous match installation, and let's opt for
asynchronous instead.

Also, let's use sd_bus_match_signal() instead of sd_bus_add_match()
where we can.
2018-01-05 13:58:32 +01:00
Lennart Poettering 4c253ed1ca tree-wide: introduce new safe_fork() helper and port everything over
This adds a new safe_fork() wrapper around fork() and makes use of it
everywhere. The new wrapper does a couple of things we previously did
manually and separately in a safer, more correct and automatic way:

1. Optionally resets signal handlers/mask in the child

2. Sets a name on all processes we fork off right after forking off (and
   the patch assigns useful names for all processes we fork off now,
   following a systematic naming scheme: always enclosed in () – in order
   to indicate that these are not proper, exec()ed processes, but only
   forked off children, and if the process is long-running with only our
   own code, without execve()'ing something else, it gets am "sd-" prefix.)

3. Optionally closes all file descriptors in the child

4. Optionally sets a PR_SET_DEATHSIG to SIGTERM in the child, in a safe
   way so that the parent dying before this happens being handled
   safely.

5. Optionally reopens the logs

6. Optionally connects stdin/stdout/stderr to /dev/null

7. Debug logs about the forked off processes.
2017-12-25 11:48:21 +01:00
Lennart Poettering a8ea93a5e2 core: use empty_to_null() where we can 2017-12-07 12:13:00 +01:00
Michal Koutný deb4e7080d service: Don't stop unneeded units needed by restarted service (#7526)
An auto-restarted unit B may depend on unit A with StopWhenUnneeded=yes.
If A stops before B's restart timeout expires, it'll be started again as part
of B's dependent jobs. However, if stopping takes longer than the timeout, B's
running stop job collides start job which also cancels B's start job. Result is
that neither A or B are active.

Currently, when a service with automatic restarting fails, it transitions
through following states:
        1) SERVICE_FAILED or SERVICE_DEAD to indicate the failure,
        2) SERVICE_AUTO_RESTART while restart timer is running.

The StopWhenUnneeded= check takes place in service_enter_dead between the two
state mentioned above. We temporarily store the auto restart flag to query it
during the check. Because we don't return control to the main event loop, this
new service unit flag needn't be serialized.

This patch prevents the pathologic situation when the service with Restart=
won't restart automatically. As a side effect it also avoid restarting the
dependency unit with StopWhenUnneeded=yes.

Fixes: #7377
2017-12-05 16:51:19 +01:00
Lennart Poettering 50fb00b707 core: use safe_fclose() where we can 2017-11-29 12:34:12 +01:00
Lennart Poettering 45639f1be5 core: never remove "transient" and "control" directories from unit search path
This changes the unit search path logic to never drop the transient and
control directories from the unit search path. This is necessary as we
add new entries to both during runtime, due to the "systemctl
set-property" and transient unit logic.

Previously, the "transient" directory was created during early boot to
deal with this, but the "control" directories were not covered like
that. Creating the control directories early at boot is not possible
however, as /etc might be read-only then, and we do define a persistent
control directory. Hence, let's create these dirs on-demand when we need
them, and make sure the search path clean-up logic never drops them from
the search path even if they are initially missing.

(Also, always create these paths properly labelled)
2017-11-29 12:34:12 +01:00
Lennart Poettering 0126c8f3f6 core: minor simplification 2017-11-29 12:34:12 +01:00
Lennart Poettering 2e59b241ca core: add proper escaping to writing of drop-ins/transient unit files
This majorly refactors the transient unit file and drop-in writing
logic, so that we properly C-escape and specifier-escape (% → %%)
everything we write out, so that when we read it back again, specifiers
are parsed that aren't supposed to be parsed.

This renames unit_write_drop_in() and friends by unit_write_setting().
The name change is supposed to clarify that the functions are not only
used to write drop-in files, but also transient unit files.

The previous "mode" parameter to this function is replaced by a more
generic "flags", which knows additional flags for implicit C-style and
specifier escaping before writing things out. This can cover most
properties where either form of escaping is defined. For the cases where
this isn't sufficient, we add helpers unit_escape_setting() and
unit_concat_strv() for escaping individual strings or strvs properly.

While we are at it, we also prettify generation of transient unit files:
we try to reduce the number of section headers written out: previously
we'd write the right section header our for each setting. With this
change we do so only if the setting lives in a different section than
the one before.

(This should also be considered preparation for when we add proper APIs
to systemd to write normal, persistant unit files through the bus API)
2017-11-29 12:34:12 +01:00
Lennart Poettering a4634b214c core: warn about left-over processes in cgroup on unit start
Now that we don't kill control processes anymore, let's at least warn
about any processes left-over in the unit cgroup at the moment of
starting the unit.
2017-11-25 17:08:21 +01:00
Lennart Poettering e98b2fbbe9 core: generalize the cgroup empty check on GC
Let's move the cgroup empty check for all unit types into the generic
unit_check_gc() call, out of the per-unit-type _check_gc() type. This
not only allows us to share some code, but also hooks up mount and
socket units with this kind of check, for free, as it was missing there
previously.
2017-11-25 17:08:21 +01:00
Lennart Poettering 60c728adf7 unit: initialize bpf cgroup realization state properly
Before this patch, the bpf cgroup realization state was implicitly set
to "NO", meaning that the bpf configuration was realized but was turned
off. That means invalidation requests for the bpf stuff (which we issue
in blanket fashion when doing a daemon reload) would actually later
result in a us re-realizing the unit, under the assumption it was
already realized once, even though in reality it never was realized
before.

This had the effect that after each daemon-reload we'd end up realizing
*all* defined units, even the unloaded ones, populating cgroupfs with
lots of unneeded empty cgroups.

With this fix we properly set the realiazation state to "INVALIDATED",
i.e. indicating the bpf stuff was never set up for the unit, and hence
when we try to invalidate it later we won't do anything.
2017-11-25 17:08:21 +01:00
Daniel Lockyer a7419dbc59 Replace use of snprintf with xsprintf 2017-11-24 10:36:04 +00:00
Zbigniew Jędrzejewski-Szmek ffb70e4424
Merge pull request #7381 from poettering/cgroup-unified-delegate-rework
Fix delegation in the unified hierarchy + more cgroup work
2017-11-22 07:42:08 +01:00
Lennart Poettering 3c7416b6ca core: unify common code for preparing for forking off unit processes
This introduces a new function unit_prepare_exec() that encapsulates a
number of calls we do in preparation for spawning off some processes in
all our unit types that do so.

This allows us to neatly unify a bit of code between unit types and
shorten our code.
2017-11-21 11:54:08 +01:00
Lennart Poettering e7dfbb4e74 core: introduce SuccessAction= as unit file property
SuccessAction= is similar to FailureAction= but declares what to do on
success of a unit, rather than on failure. This is useful for running
commands in qemu/nspawn images, that shall power down on completion. We
frequently see "ExecStopPost=/usr/bin/systemctl poweroff" or so in unit
files like this. Offer a simple, more declarative alternative for this.

While we are at it, hook up failure action with unit_dump() and
transient units too.
2017-11-20 16:37:22 +01:00
Lennart Poettering 53c35a766f core: generalize FailureAction= move it from service to unit
All kinds of units can fail, hence it makes sense to offer this as
generic concept for all unit types.
2017-11-20 16:37:22 +01:00
Lennart Poettering 0133d5553a
Merge pull request #7198 from poettering/stdin-stdout
Add StandardInput=data, StandardInput=file:... and more
2017-11-19 19:49:11 +01:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Lennart Poettering 99be45a46f fs-util: rename path_is_safe() → path_is_normalized()
Already, path_is_safe() refused paths container the "." dir. Doing that
isn't strictly necessary to be "safe" by most definitions of the word.
But it is necessary in order to consider a path "normalized". Hence,
"path_is_safe()" is slightly misleading a name, but
"path_is_normalize()" is more descriptive, hence let's rename things
accordingly.

No functional changes.
2017-11-17 11:13:44 +01:00
Lennart Poettering 5afe510c89 core: add a new unit file setting CollectMode= for tweaking the GC logic
Right now, the option only takes one of two possible values "inactive"
or "inactive-or-failed", the former being the default, and exposing same
behaviour as the status quo ante. If set to "inactive-or-failed" units
may be collected by the GC logic when in the "failed" state too.

This logic should be a nicer alternative to using the "-" modifier for
ExecStart= and friends, as the exit data is collected and logged about
and only removed when the GC comes along. This should be useful in
particular for per-connection socket-activated services, as well as
"systemd-run" command lines that shall leave no artifacts in the
system.

I was thinking about whether to expose this as a boolean, but opted for
an enum instead, as I have the suspicion other tweaks like this might be
a added later on, in which case we extend this setting instead of having
to add yet another one.

Also, let's add some documentation for the GC logic.
2017-11-16 14:38:36 +01:00
Lennart Poettering 7eb2a8a125 unit: rework a bit how we keep the service fdstore from being destroyed during service restart
When preparing for a restart we quickly go through the DEAD/INACTIVE
service state before entering AUTO_RESTART. When doing this, we need to
make sure we don't destroy the FD store. Previously this was done by
checking the failure state of the unit, and keeping the FD store around
when the unit failed, under the assumption that the restart logic will
then get into action.

This is not entirely correct howver, as there might be failure states
that will no result in restarts.

With this commit we slightly alter the logic: a ref counter for the fd
store is added, that is increased right before we handle the restart
logic, and decreased again right-after.

This should ensure that the fdstore lives exactly as long as it needs.

Follow-up for f0bfbfac43.
2017-11-16 14:37:33 +01:00