Commit graph

350 commits

Author SHA1 Message Date
Lennart Poettering 4b58153dd2 core: add "invocation ID" concept to service manager
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.

The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.

The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.

The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.

Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.

This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().

PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.

A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.

Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
2016-10-07 20:14:38 +02:00
Zbigniew Jędrzejewski-Szmek dd5e7000cb core: complain if Before= dep on .device is declared
[Unit]
Before=foobar.device

[Service]
ExecStart=/bin/true
Type=oneshot

$ systemd-analyze verify before-device.service
before-device.service: Dependency Before=foobar.device ignored (.device units cannot be delayed)
2016-10-01 22:53:17 +02:00
Lennart Poettering 63bb64a056 core: imply ProtectHome=read-only and ProtectSystem=strict if DynamicUser=1
Let's make sure that services that use DynamicUser=1 cannot leave files in the
file system should the system accidentally have a world-writable directory
somewhere.

This effectively ensures that directories need to be whitelisted rather than
blacklisted for access when DynamicUser=1 is set.
2016-09-25 10:42:18 +02:00
Lennart Poettering 390bc2b149 core: let's use set_contains() where appropriate 2016-08-22 16:14:21 +02:00
Lennart Poettering fe700f46ec core: cache last CPU usage counter, before destorying a cgroup
It is useful for clients to be able to read the last CPU usage counter value of
a unit even if the unit is already terminated. Hence, before destroying a
cgroup's cgroup cache the last CPU usage counter and return it if the cgroup is
gone.
2016-08-22 16:14:21 +02:00
Lennart Poettering 05a98afd3e core: add Ref()/Unref() bus calls for units
This adds two (privileged) bus calls Ref() and Unref() to the Unit interface.
The two calls may be used by clients to pin a unit into memory, so that various
runtime properties aren't flushed out by the automatic GC. This is necessary
to permit clients to race-freely acquire runtime results (such as process exit
status/code or accumulated CPU time) on successful service termination.

Ref() and Unref() are fully recursive, hence act like the usual reference
counting concept in C. Taking a reference is a privileged operation, as this
allows pinning units into memory which consumes resources.

Transient units may also gain a reference at the time of creation, via the new
AddRef property (that is only defined for transient units at the time of
creation).
2016-08-22 16:14:21 +02:00
Zbigniew Jędrzejewski-Szmek 2056ec1927 Merge pull request #3965 from htejun/systemd-controller-on-unified 2016-08-19 19:58:01 -04:00
Lennart Poettering 00d9ef8560 core: add RemoveIPC= setting
This adds the boolean RemoveIPC= setting to service, socket, mount and swap
units (i.e.  all unit types that may invoke processes). if turned on, and the
unit's user/group is not root, all IPC objects of the user/group are removed
when the service is shut down. The life-cycle of the IPC objects is hence bound
to the unit life-cycle.

This is particularly relevant for units with dynamic users, as it is essential
that no objects owned by the dynamic users survive the service exiting. In
fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set.

In order to communicate the UID/GID of an executed process back to PID 1 this
adds a new "user lookup" socket pair, that is inherited into the forked
processes, and closed before the exec(). This is needed since we cannot do NSS
from PID 1 due to deadlock risks, However need to know the used UID/GID in
order to clean up IPC owned by it if the unit shuts down.
2016-08-19 00:37:25 +02:00
Tejun Heo 5da38d0768 core: use the unified hierarchy for the systemd cgroup controller hierarchy
Currently, systemd uses either the legacy hierarchies or the unified hierarchy.
When the legacy hierarchies are used, systemd uses a named legacy hierarchy
mounted on /sys/fs/cgroup/systemd without any kernel controllers for process
management.  Due to the shortcomings in the legacy hierarchy, this involves a
lot of workarounds and complexities.

Because the unified hierarchy can be mounted and used in parallel to legacy
hierarchies, there's no reason for systemd to use a legacy hierarchy for
management even if the kernel resource controllers need to be mounted on legacy
hierarchies.  It can simply mount the unified hierarchy under
/sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies.
This disables a significant amount of fragile workaround logics and would allow
using features which depend on the unified hierarchy membership such bpf cgroup
v2 membership test.  In time, this would also allow deleting the said
complexities.

This patch updates systemd so that it prefers the unified hierarchy for the
systemd cgroup controller hierarchy when legacy hierarchies are used for kernel
resource controllers.

* cg_unified(@controller) is introduced which tests whether the specific
  controller in on unified hierarchy and used to choose the unified hierarchy
  code path for process and service management when available.  Kernel
  controller specific operations remain gated by cg_all_unified().

* "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to
  force the use of legacy hierarchy for systemd cgroup controller.

* nspawn: By default nspawn uses the same hierarchies as the host.  If
  UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all.  If
  0, legacy for all.

* nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of
  three options - legacy, only systemd controller on unified, and unified.  The
  value is passed into mount setup functions and controls cgroup configuration.

* nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount
  option is moved to mount_legacy_cgroup_hierarchy() so that it can take an
  appropriate action depending on the configuration of the host.

v2: - CGroupUnified enum replaces open coded integer values to indicate the
      cgroup operation mode.
    - Various style updates.

v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2.

v4: Restored legacy container on unified host support and fixed another bug in
    detect_unified_cgroup_hierarchy().
2016-08-17 17:44:36 -04:00
Tejun Heo ca2f6384aa core: rename cg_unified() to cg_all_unified()
A following patch will update cgroup handling so that the systemd controller
(/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel
resource controllers are on the legacy hierarchies.  This would require
distinguishing whether all controllers are on cgroup v2 or only the systemd
controller is.  In preparation, this patch renames cg_unified() to
cg_all_unified().

This patch doesn't cause any functional changes.
2016-08-15 18:13:36 -04:00
Tejun Heo 66ebf6c0a1 core: add cgroup CPU controller support on the unified hierarchy
Unfortunately, due to the disagreements in the kernel development community,
CPU controller cgroup v2 support has not been merged and enabling it requires
applying two small out-of-tree kernel patches.  The situation is explained in
the following documentation.

 https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu

While it isn't clear what will happen with CPU controller cgroup v2 support,
there are critical features which are possible only on cgroup v2 such as
buffered write control making cgroup v2 essential for a lot of workloads.  This
commit implements systemd CPU controller support on the unified hierarchy so
that users who choose to deploy CPU controller cgroup v2 support can easily
take advantage of it.

On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max"
replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us".  [Startup]CPUWeight config
options are added with the usual compat translation.  CPU quota settings remain
unchanged and apply to both legacy and unified hierarchies.

v2: - Error in man page corrected.
    - CPU config application in cgroup_context_apply() refactored.
    - CPU accounting now works on unified hierarchy.
2016-08-07 09:45:39 -04:00
Lennart Poettering 29206d4619 core: add a concept of "dynamic" user ids, that are allocated as long as a service is running
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).

For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.

A simple way to test this is:

        systemd-run -p DynamicUser=1 /bin/sleep 99999
2016-07-22 15:53:45 +02:00
Lennart Poettering 1d98fef17d core: when forcibly killing/aborting left-over unit processes log about it
Let's lot at LOG_NOTICE about any processes that we are going to
SIGKILL/SIGABRT because clean termination of them didn't work.

This turns the various boolean flag parameters to cg_kill(), cg_migrate() and
related calls into a single binary flags parameter, simply because the function
now gained even more parameters and the parameter listed shouldn't get too
long.

Logging for killing processes is done either when the kill signal is SIGABRT or
SIGKILL, or on explicit request if KILL_TERMINATE_AND_LOG instead of LOG_TERMINATE
is passed. This isn't used yet in this patch, but is made use of in a later
patch.
2016-07-20 14:35:15 +02:00
Michael Biebl 595bfe7df2 Various fixes for typos found by lintian (#3705) 2016-07-12 12:52:11 +02:00
Torstein Husebø 61233823aa treewide: fix typos and remove accidental repetition of words 2016-07-11 16:18:43 +02:00
David Michael 4f952a3f07 core: queue loading transient units after setting their properties (#3676)
The unit load queue can be processed in the middle of setting the
unit's properties, so its load_state would no longer be UNIT_STUB
for the check in bus_unit_set_properties(), which would cause it to
incorrectly return an error.
2016-07-08 05:43:01 +02:00
Kyle Walker 36f20ae3b2 manager: Only invoke a single sigchld per unit within a cleanup cycle
By default, each iteration of manager_dispatch_sigchld() results in a unit level
sigchld event being invoked. For scope units, this results in a scope_sigchld_event()
which can seemingly stall for workloads that have a large number of PIDs within the
scope. The stall exhibits itself as a SIG_0 being initiated for each u->pids entry
as a result of pid_is_unwaited().

v2:
This patch resolves this condition by only paying to cost of a sigchld in the underlying
scope unit once per sigchld iteration. A new "sigchldgen" member resides within the
Unit struct. The Manager is incremented via the sd event loop, accessed via
sd_event_get_iteration, and the Unit member is set to the same value as the manager each
time that a sigchld event is invoked. If the Manager iteration value and Unit member
match, the sigchld event is not invoked for that iteration.
2016-06-30 15:16:47 -04:00
Lennart Poettering fc40065bcd core: when writing transient unit files, make sure all lines end with a newline
This is a fix-up for 2a9a6f8ac0 which covered
non-transient units, but missed the case for transient units.
2016-06-23 01:29:33 +02:00
Lennart Poettering 3f71dec5d7 unit: properly comment generated comments in unit files
Fix-up for 2a9a6f8ac0
2016-06-14 20:01:45 +02:00
Zbigniew Jędrzejewski-Szmek 2a9a6f8ac0 core/unit: append newline when writing drop ins
unit_write_drop_in{,_private}{,_format} are all affected.

We already append a header to the file (and section markers), so those functions
can only be used to write a whole file at once. Including the newline at
the end feels natural.

After this commit newlines will be duplicated. They will be removed in
subsequent commit.

Also, rewrap the "autogenerated" header to fit within 80 columns.
2016-05-28 16:17:54 -04:00
Lennart Poettering 3103459e90 Merge pull request #3193 from htejun/cgroup-io-controller
core: add io controller support on the unified hierarchy
2016-05-16 22:05:27 +02:00
Michal Sekletar 833f92ad39 core: don't log job status message in case job was effectively NOP (#3199)
We currently generate log message about unit being started even when
unit was started already and job didn't do anything. This is because job
was requested explicitly and hence became anchor job of the transaction
thus we could not eliminate it. That is fine but, let's not pollute
journal with useless log messages.

$ systemctl start systemd-resolved
$ systemctl start systemd-resolved
$ systemctl start systemd-resolved

Current state:
$ journalctl -u systemd-resolved | grep Started

May 05 15:31:42 rawhide systemd[1]: Started Network Name Resolution.
May 05 15:31:59 rawhide systemd[1]: Started Network Name Resolution.
May 05 15:32:01 rawhide systemd[1]: Started Network Name Resolution.

After patch applied:
$ journalctl -u systemd-resolved | grep Started

May 05 16:42:12 rawhide systemd[1]: Started Network Name Resolution.

Fixes #1723
2016-05-16 11:24:51 -04:00
Tejun Heo 99e66921c8 core: allow slice to be overriden if cgroups aren't realized (#3246)
unit_set_slice() fails with -EBUSY if the unit already has a slice associated
with it.  This makes it impossible to override slice through dropin config or
over dbus.  There's no reason to disallow slice changes as long as cgroups
aren't realized.  Fix it.

Fixes #3240.

Signed-off-by: Tejun Heo <htejun@fb.com>
Reported-by: Davide Cavalca <dcavalca@fb.com>
2016-05-14 15:56:53 -04:00
Lennart Poettering f76707da45 core: update the right mtime after finishing writing of transient units (#3203)
Fixes: #3194
2016-05-06 19:22:22 +03:00
Tejun Heo 13c31542cc core: add io controller support on the unified hierarchy
On the unified hierarchy, blkio controller is renamed to io and the interface
is changed significantly.

* blkio.weight and blkio.weight_device are consolidated into io.weight which
  uses the standardized weight range [1, 10000] with 100 as the default value.

* blkio.throttle.{read|write}_{bps|iops}_device are consolidated into io.max.
  Expansion of throttling features is being worked on to support
  work-conserving absolute limits (io.low and io.high).

* All stats are consolidated into io.stats.

This patchset adds support for the new interface.  As the interface has been
revamped and new features are expected to be added, it seems best to treat it
as a separate controller rather than trying to expand the blkio settings
although we might add automatic translation if only blkio settings are
specified.

* io.weight handling is mostly identical to blkio.weight[_device] handling
  except that the weight range is different.

* Both read and write bandwidth settings are consolidated into
  CGroupIODeviceLimit which describes all limits applicable to the device.
  This makes it less painful to add new limits.

* "max" can be used to specify the maximum limit which is equivalent to no
  config for max limits and treated as such.  If a given CGroupIODeviceLimit
  doesn't contain any non-default configs, the config struct is discarded once
  the no limit config is applied to cgroup.

* lookup_blkio_device() is renamed to lookup_block_device().

Signed-off-by: Tejun Heo <htejun@fb.com>
2016-05-05 16:43:06 -04:00
Lennart Poettering 1ed7ebcfca Merge pull request #3170 from poettering/v230-preparation-fixes
make virtualization detection quieter, rework unit start limit logic, detect unit file drop-in changes correctly, fix autofs state propagation
2016-05-04 10:46:13 +02:00
Zbigniew Jędrzejewski-Szmek a82394c889 Merge pull request #2921 from keszybz/do-not-report-masked-units-as-changed 2016-05-03 14:08:39 -04:00
Zbigniew Jędrzejewski-Szmek d43bbb52de Revert "Do not report masked units as changed (#2921)"
This reverts commit 6d10d308c6.

It got squashed by mistake.
2016-05-03 14:08:23 -04:00
Lennart Poettering 5c6c275e43 Merge pull request #3162 from keszybz/alias-refusal
Refuse Alias, DefaultInstance, templated units in install (as appropriate)
2016-05-02 20:40:54 +02:00
Lennart Poettering ab932a622d core: simplify unit_need_daemon_reload() a bit
And let's make it more accurate: if we have acquire the list of unit drop-ins,
then let's do a full comparison against the old list we already have, and if
things differ in any way, we know we have to reload.

This makes sure we detect changes to drop-in directories in more cases.
2016-05-02 15:10:35 +02:00
Lennart Poettering 87ec20ef20 core: fix detection whether per-unit drop-ins changed
This fixes fall-out from 6d10d308c6.

Until that commit, do determine whether a daemon reload was required we compare
the mtime of the main unit file we loaded with the mtime of it on disk for
equality, but for drop-ins we only stored the newest mtime of all of them and
then did a "newer-than" comparison. This was brokeni with the above commit,
when all checks where changed to be for equality.

With this change all checks are now done as "newer-than", fixing the drop-in
mtime case. Strictly speaking this will not detect a number of changes that the
code before above commit detected, but given that the mtime is unlikely to go
backwards, and this is just intended to be a helpful hint anyway, this looks OK
in order to keep things simple.

Fixes: #3123
2016-05-02 15:10:24 +02:00
Lennart Poettering 072993504e core: move enforcement of the start limit into per-unit-type code again
Let's move the enforcement of the per-unit start limit from unit.c into the
type-specific files again. For unit types that know a concept of "result" codes
this allows us to hook up the start limit condition to it with an explicit
result code. Also, this makes sure that the state checks in clal like
service_start() may be done before the start limit is checked, as the start
limit really should be checked last, right before everything has been verified
to be in order.

The generic start limit logic is left in unit.c, but the invocation of it is
moved into the per-type files, in the various xyz_start() functions, so that
they may place the check at the right location.

Note that this change drops the enforcement entirely from device, slice, target
and scope units, since these unit types generally may not fail activation, or
may only be activated a single time. This is also documented now.

Note that restores the "start-limit-hit" result code that existed before
6bf0f408e4 already in the service code. However,
it's not introduced for all units that have a result code concept.

Fixes #3166.
2016-05-02 13:08:00 +02:00
Zbigniew Jędrzejewski-Szmek ce99c68a33 Move no_instances information to shared/
This way it can be used in install.c in subsequent commit.
2016-05-01 19:58:59 -04:00
Zbigniew Jędrzejewski-Szmek 8a993b61d1 Move no_alias information to shared/
This way it can be used in install.c in subsequent commit.
2016-05-01 19:40:51 -04:00
Zbigniew Jędrzejewski-Szmek bc1d8669b8 Merge pull request #3152 from poettering/aliasfix
Refuse aliases to non-aliasable units in more places

Fixes #2730.
2016-04-30 18:00:46 -04:00
Lennart Poettering 934e749e18 core: refuse merging on units when the unit type does not support alias
The concept of merging units exists so that we can create Unit objects for a
number of names early, and then load them only later, possibly merging units
which then turn out to be symlinked to other names. This of course only makes
sense for unit types where multiple names per unit are supported. For all
others, let's refuse the merge operation early.
2016-04-29 17:31:02 +02:00
Lennart Poettering b75102e5bf core: rerun GC logic for a unit that loses a reference
Let's make sure when we drop a reference to a unit, that we run the GC queue on
it again.

This (together with the previous commit) should deal with the GC issues pointed
out in:

https://github.com/systemd/systemd/pull/2993#issuecomment-215331189
2016-04-29 16:27:49 +02:00
Lennart Poettering 7629ec4642 core: move start ratelimiting check after condition checks
With #2564 unit start rate limiting was moved from after the condition checks
are to before they are made, in an attempt to fix #2467. This however resulted
in #2684. However, with a previous commit a concept of per socket unit trigger
rate limiting has been added, to fix #2467 more comprehensively, hence the
start limit can be moved after the condition checks again, thus fixing #2684.

Fixes: #2684
2016-04-29 16:27:48 +02:00
Lennart Poettering 291d565a04 core,systemctl: add bus API to retrieve processes of a unit
This adds a new GetProcesses() bus call to the Unit object which returns an
array consisting of all PIDs, their process names, as well as their full cgroup
paths. This is then used by "systemctl status" to show the per-unit process
tree.

This has the benefit that the client-side no longer needs to access the
cgroupfs directly to show the process tree of a unit. Instead, it now uses this
new API, which means it also works if -H or -M are used correctly, as the
information from the specific host is used, and not the one from the local
system.

Fixes: #2945
2016-04-22 16:06:20 +02:00
Zbigniew Jędrzejewski-Szmek ccddd104fc tree-wide: use mdash instead of a two minuses 2016-04-21 23:00:13 -04:00
Zbigniew Jędrzejewski-Szmek 81d621034b tree-wide: remove useless NULLs from strjoina
The coccinelle patch didn't work in some places, I have no idea why.
2016-04-13 08:56:44 -04:00
Zbigniew Jędrzejewski-Szmek 78e334b50f basic/util: silence stupid gcc warnings about unitialized variable 2016-04-13 08:56:44 -04:00
Lennart Poettering f9ba08fb4f core: keep track of the mtime of the transient unit file we wrote
Otherwise "systemctl status" will immediately report that our unit file is out
of date.
2016-04-12 13:43:33 +02:00
Lennart Poettering 815b09d39b core: optimize unit_write_drop_in a bit
There's no point in first determining the drop-in file name path, then
forgetting it again, and then determining it again. Instead, just generated it
once, and then write to ti directly.
2016-04-12 13:43:33 +02:00
Lennart Poettering e20b2a867a core: when creating a drop-in snippet, add a comment explaining this to it 2016-04-12 13:43:32 +02:00
Lennart Poettering 6eb7c172b5 tree-wide: add new SIGNAL_VALID() macro-like function that validates signal numbers
And port all code over to use it.
2016-04-12 13:43:32 +02:00
Lennart Poettering 4f4afc88ec core: rework how transient unit files and property drop-ins work
With this change the logic for placing transient unit files and drop-ins
generated via "systemctl set-property" is reworked.

The latter are now placed in the newly introduced "control" unit file
directory. The fomer are now placed in the "transient" unit file directory.

Note that the properties originally set when a transient unit was created will
be written to and stay in the transient unit file directory, while later
changes are done via drop-ins.

This is preparation for a later "systemctl revert" addition, where existing
drop-ins are flushed out, but the original transient definition is restored.
2016-04-12 13:43:32 +02:00
Lennart Poettering 193dc81ee3 core: don't reorder drop-ins when changing properties
The drop-in order we present should actually show what we is in effect, hence
let's not reorder it when writing changes. After all, just sorting
alphabetically is going to break things, as it doesn't respect that /etc breaks
/run breaks /usr...
2016-04-12 13:43:31 +02:00
Lennart Poettering 3959135139 core: add a separate unit directory for transient units
Previously, transient units were created below the normal runtime directory
/run/systemd/system. With this change they are created in a special transient
directory /run/systemd/transient, which only contains data for transient units.

This clarifies the life-cycle of transient units, and makes clear they are
distinct from user-provided runtime units. In particular, users may now
extend transient units via /run/systemd/system, without systemd interfering
with the life-cycle of these files.

This change also adds code so that when a transient unit exits only the
drop-ins in this new directory are removed, but nothing else.

Fixes: #2139
2016-04-12 13:43:30 +02:00
Lennart Poettering 2c289ea833 core: introduce MANAGER_IS_RELOADING() macro
This replaces the old function call manager_is_reloading_or_reexecuting() which
was used only at very few places. Use the new macro wherever we check whether
we are reloading. This should hopefully make things a bit more readable, given
the nature of Manager:n_reloading being a counter.
2016-04-12 13:43:30 +02:00