Commit graph

155 commits

Author SHA1 Message Date
Lennart Poettering f1c50becda core: make sure to log invocation ID of units also when doing structured logging 2017-09-22 15:24:55 +02:00
Lennart Poettering 6b659ed87e core: serialize/deserialize IP accounting across daemon reload/reexec
Make sure the current IP accounting counters aren't lost during
reload/reexec.

Note that we destroy all BPF file objects during a reload: the BPF
programs, the access and the accounting maps. The former two need to be
regenerated anyway with the newly loaded configuration data, but the
latter one needs to survive reloads/reexec. In this implementation I
opted to only save/restore the accounting map content instead of the map
itself. While this opens a (theoretic) window where IP traffic is still
accounted to the old map after we read it out, and we thus miss a few
bytes this has the benefit that we can alter the map layout between
versions should the need arise.
2017-09-22 15:24:55 +02:00
Lennart Poettering a79279c7fd core: when creating the socket fds for a socket unit, join socket's cgroup first
Let's make sure that a socket unit's IPAddressAllow=/IPAddressDeny=
settings are in effect on all socket fds associated with it. In order to
make this happen we need to make sure the cgroup the fds are associated
with are the socket unit's cgroup. The only way to do that is invoking
socket()+accept() in them. Since we really don't want to migrate PID 1
around we do this by forking off a helper process, which invokes
socket()/accept() and sends the newly created fd to PID 1. Ugly, but
works, and there's apparently no better way right now.

This generalizes forking off per-unit helper processes in a new function
unit_fork_helper_process(), which is then also used by the NSS chown()
code of socket units.
2017-09-22 15:24:55 +02:00
Daniel Mack 906c06f64a cgroup, unit, fragment parser: make use of new firewall functions 2017-09-22 15:24:55 +02:00
Daniel Mack 6a48d82f02 cgroup: add fields to accommodate eBPF related details
Add pointers for compiled eBPF programs as well as list heads for allowed
and denied hosts for both directions.
2017-09-22 15:24:54 +02:00
Lennart Poettering f0d477979e core: introduce unit_set_exec_params()
The new unit_set_exec_params() call is to units what
manager_set_exec_params() is to the manager object: it initializes the
various fields from the relevant generic properties set.
2017-08-10 15:02:50 +02:00
Zbigniew Jędrzejewski-Szmek 0742986650 core: properly handle deserialization of unknown unit types (#6476)
We just abort startup, without printing any error. Make sure we always
print something, and when we cannot deserialize some unit, just ignore it and
continue.

Fixup for 4bc5d27b94. Without this, we would hang
in daemon-reexec after upgrade.
2017-07-31 08:05:35 +02:00
Zbigniew Jędrzejewski-Szmek 4bc5d27b94 Drop busname unit type
Since busname units are only useful with kdbus, they weren't actively
used. This was dead code, only compile-tested. If busname units are
ever added back, it'll be cleaner to start from scratch (possibly reverting
parts of this patch).
2017-07-23 09:29:02 -04:00
Michal Koutný a2df3ea4ae job: add JobRunningTimeoutSec for JOB_RUNNING state
Unit.JobTimeoutSec starts when a job is enqueued in a transaction. The
introduced distinct Unit.JobRunningTimeoutSec starts only when the job starts
running (e.g. it groups all Exec* commands of a service or spans waiting for a
device period.)

Unit.JobRunningTimeoutSec is intended to be used by default instead of
Unit.JobTimeoutSec for device units where such behavior causes less confusion
(consider a job for a _netdev mount device, with this change the timeout will
start ticking only after the network is ready).
2017-04-25 18:00:29 +02:00
Lennart Poettering 2e6dbc0fcd Merge pull request #4538 from fbuihuu/confirm-spawn-fixes
Confirm spawn fixes/enhancements
2016-11-18 11:08:06 +01:00
Franck Bui c891efaf8a core: confirm_spawn: always accept units with same_pgrp set for now
For some reasons units remaining in the same process group as PID 1
(same_pgrp=true) fail to acquire the console even if it's not taken by anyone.

So always accept for units with same_pgrp set for now.
2016-11-17 18:16:51 +01:00
Lennart Poettering c5a97ed132 core: GC redundant device jobs from the run queue
In contrast to all other unit types device units when queued just track
external state, they cannot effect state changes on their own. Hence unless a
client or other job waits for them there's no reason to keep them in the job
queue. This adds a concept of GC'ing jobs of this type as soon as no client or
other job waits for them anymore.

To ensure this works correctly we need to track which clients actually
reference a job (i.e. which ones enqueued it). Unfortunately that's pretty
nasty to do for direct connections, as sd_bus_track doesn't work for
them. For now, work around this, by simply remembering in a boolean that a job
was requested by a direct connection, and reset it when we notice the direct
connection is gone. This means the GC logic works fine, except that jobs are
not immediately removed when direct connections disconnect.

In the longer term, a rework of the bus logic should fix this properly. For now
this should be good enough, as GC works for fine all cases except this one, and
thus is a clear improvement over the previous behaviour.

Fixes: #1921
2016-11-16 15:03:26 +01:00
Zbigniew Jędrzejewski-Szmek 7fa6328cc4 Merge pull request #4481 from poettering/perpetual
Add "perpetual" unit concept, sysctl fixes, networkd fixes, systemctl color fixes, nspawn discard.
2016-11-02 21:03:26 -04:00
Lennart Poettering a581e45ae8 unit: unify some code with new unit_new_for_name() call 2016-11-02 11:29:59 -06:00
Lennart Poettering f5869324e3 core: rework the "no_gc" unit flag to become a more generic "perpetual" flag
So far "no_gc" was set on -.slice and init.scope, to units that are always
running, cannot be stopped and never exist in an "inactive" state. Since these
units are the only users of this flag, let's remodel it and rename it
"perpetual" and let's derive more funcitonality off it. Specifically, refuse
enqueing stop jobs for these units, and report that they are "unstoppable" in
the CanStop bus property.
2016-11-02 11:29:59 -06:00
Zbigniew Jędrzejewski-Szmek f0bfbfac43 core: when restarting services, don't close fds
We would close all the stored fds in service_release_resources(), which of
course broke the whole concept of storing fds over service restart.

Fixes #4408.
2016-11-01 21:20:21 -04:00
Lukas Nykryn 87a47f99bc failure-action: generalize failure action to emergency action 2016-10-21 15:13:50 +02:00
Lennart Poettering 4b58153dd2 core: add "invocation ID" concept to service manager
This adds a new invocation ID concept to the service manager. The invocation ID
identifies each runtime cycle of a unit uniquely. A new randomized 128bit ID is
generated each time a unit moves from and inactive to an activating or active
state.

The primary usecase for this concept is to connect the runtime data PID 1
maintains about a service with the offline data the journal stores about it.
Previously we'd use the unit name plus start/stop times, which however is
highly racy since the journal will generally process log data after the service
already ended.

The "invocation ID" kinda matches the "boot ID" concept of the Linux kernel,
except that it applies to an individual unit instead of the whole system.

The invocation ID is passed to the activated processes as environment variable.
It is additionally stored as extended attribute on the cgroup of the unit. The
latter is used by journald to automatically retrieve it for each log logged
message and attach it to the log entry. The environment variable is very easily
accessible, even for unprivileged services. OTOH the extended attribute is only
accessible to privileged processes (this is because cgroupfs only supports the
"trusted." xattr namespace, not "user."). The environment variable may be
altered by services, the extended attribute may not be, hence is the better
choice for the journal.

Note that reading the invocation ID off the extended attribute from journald is
racy, similar to the way reading the unit name for a logging process is.

This patch adds APIs to read the invocation ID to sd-id128:
sd_id128_get_invocation() may be used in a similar fashion to
sd_id128_get_boot().

PID1's own logging is updated to always include the invocation ID when it logs
information about a unit.

A new bus call GetUnitByInvocationID() is added that allows retrieving a bus
path to a unit by its invocation ID. The bus path is built using the invocation
ID, thus providing a path for referring to a unit that is valid only for the
current runtime cycleof it.

Outlook for the future: should the kernel eventually allow passing of cgroup
information along AF_UNIX/SOCK_DGRAM messages via a unique cgroup id, then we
can alter the invocation ID to be generated as hash from that rather than
entirely randomly. This way we can derive the invocation race-freely from the
messages.
2016-10-07 20:14:38 +02:00
Evgeny Vereshchagin 6afe14ff5b Merge pull request #3984 from poettering/refcnt
permit bus clients to pin units to avoid automatic GC
2016-08-26 16:17:05 +03:00
Felipe Sateler 8dec4a9d2d core,network: Use const qualifiers for block-local variables in macro functions (#4019)
Prevents discard-qualifiers warnings when the passed variable was const
2016-08-23 12:29:30 +03:00
Lennart Poettering fe700f46ec core: cache last CPU usage counter, before destorying a cgroup
It is useful for clients to be able to read the last CPU usage counter value of
a unit even if the unit is already terminated. Hence, before destroying a
cgroup's cgroup cache the last CPU usage counter and return it if the cgroup is
gone.
2016-08-22 16:14:21 +02:00
Lennart Poettering 05a98afd3e core: add Ref()/Unref() bus calls for units
This adds two (privileged) bus calls Ref() and Unref() to the Unit interface.
The two calls may be used by clients to pin a unit into memory, so that various
runtime properties aren't flushed out by the automatic GC. This is necessary
to permit clients to race-freely acquire runtime results (such as process exit
status/code or accumulated CPU time) on successful service termination.

Ref() and Unref() are fully recursive, hence act like the usual reference
counting concept in C. Taking a reference is a privileged operation, as this
allows pinning units into memory which consumes resources.

Transient units may also gain a reference at the time of creation, via the new
AddRef property (that is only defined for transient units at the time of
creation).
2016-08-22 16:14:21 +02:00
Lennart Poettering 00d9ef8560 core: add RemoveIPC= setting
This adds the boolean RemoveIPC= setting to service, socket, mount and swap
units (i.e.  all unit types that may invoke processes). if turned on, and the
unit's user/group is not root, all IPC objects of the user/group are removed
when the service is shut down. The life-cycle of the IPC objects is hence bound
to the unit life-cycle.

This is particularly relevant for units with dynamic users, as it is essential
that no objects owned by the dynamic users survive the service exiting. In
fact, this patch adds code to imply RemoveIPC= if DynamicUser= is set.

In order to communicate the UID/GID of an executed process back to PID 1 this
adds a new "user lookup" socket pair, that is inherited into the forked
processes, and closed before the exec(). This is needed since we cannot do NSS
from PID 1 due to deadlock risks, However need to know the used UID/GID in
order to clean up IPC owned by it if the unit shuts down.
2016-08-19 00:37:25 +02:00
Lennart Poettering b4c990e91b unit: remove orphaned cgroup_netclass_id field 2016-08-18 22:49:48 +02:00
Tejun Heo 66ebf6c0a1 core: add cgroup CPU controller support on the unified hierarchy
Unfortunately, due to the disagreements in the kernel development community,
CPU controller cgroup v2 support has not been merged and enabling it requires
applying two small out-of-tree kernel patches.  The situation is explained in
the following documentation.

 https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup-v2-cpu.txt?h=cgroup-v2-cpu

While it isn't clear what will happen with CPU controller cgroup v2 support,
there are critical features which are possible only on cgroup v2 such as
buffered write control making cgroup v2 essential for a lot of workloads.  This
commit implements systemd CPU controller support on the unified hierarchy so
that users who choose to deploy CPU controller cgroup v2 support can easily
take advantage of it.

On the unified hierarchy, "cpu.weight" knob replaces "cpu.shares" and "cpu.max"
replaces "cpu.cfs_period_us" and "cpu.cfs_quota_us".  [Startup]CPUWeight config
options are added with the usual compat translation.  CPU quota settings remain
unchanged and apply to both legacy and unified hierarchies.

v2: - Error in man page corrected.
    - CPU config application in cgroup_context_apply() refactored.
    - CPU accounting now works on unified hierarchy.
2016-08-07 09:45:39 -04:00
Lennart Poettering 29206d4619 core: add a concept of "dynamic" user ids, that are allocated as long as a service is running
This adds a new boolean setting DynamicUser= to service files. If set, a new
user will be allocated dynamically when the unit is started, and released when
it is stopped. The user ID is allocated from the range 61184..65519. The user
will not be added to /etc/passwd (but an NSS module to be added later should
make it show up in getent passwd).

For now, care should be taken that the service writes no files to disk, since
this might result in files owned by UIDs that might get assigned dynamically to
a different service later on. Later patches will tighten sandboxing in order to
ensure that this cannot happen, except for a few selected directories.

A simple way to test this is:

        systemd-run -p DynamicUser=1 /bin/sleep 99999
2016-07-22 15:53:45 +02:00
Lennart Poettering 1d98fef17d core: when forcibly killing/aborting left-over unit processes log about it
Let's lot at LOG_NOTICE about any processes that we are going to
SIGKILL/SIGABRT because clean termination of them didn't work.

This turns the various boolean flag parameters to cg_kill(), cg_migrate() and
related calls into a single binary flags parameter, simply because the function
now gained even more parameters and the parameter listed shouldn't get too
long.

Logging for killing processes is done either when the kill signal is SIGABRT or
SIGKILL, or on explicit request if KILL_TERMINATE_AND_LOG instead of LOG_TERMINATE
is passed. This isn't used yet in this patch, but is made use of in a later
patch.
2016-07-20 14:35:15 +02:00
Kyle Walker 36f20ae3b2 manager: Only invoke a single sigchld per unit within a cleanup cycle
By default, each iteration of manager_dispatch_sigchld() results in a unit level
sigchld event being invoked. For scope units, this results in a scope_sigchld_event()
which can seemingly stall for workloads that have a large number of PIDs within the
scope. The stall exhibits itself as a SIG_0 being initiated for each u->pids entry
as a result of pid_is_unwaited().

v2:
This patch resolves this condition by only paying to cost of a sigchld in the underlying
scope unit once per sigchld iteration. A new "sigchldgen" member resides within the
Unit struct. The Manager is incremented via the sd event loop, accessed via
sd_event_get_iteration, and the Unit member is set to the same value as the manager each
time that a sigchld event is invoked. If the Manager iteration value and Unit member
match, the sigchld event is not invoked for that iteration.
2016-06-30 15:16:47 -04:00
Zbigniew Jędrzejewski-Szmek 74ad38ff0e Merge pull request #3160 from htejun/cgroup-fixes-rev2
Cgroup fixes.
2016-05-07 15:08:57 -04:00
Lennart Poettering 1ed7ebcfca Merge pull request #3170 from poettering/v230-preparation-fixes
make virtualization detection quieter, rework unit start limit logic, detect unit file drop-in changes correctly, fix autofs state propagation
2016-05-04 10:46:13 +02:00
Lennart Poettering 072993504e core: move enforcement of the start limit into per-unit-type code again
Let's move the enforcement of the per-unit start limit from unit.c into the
type-specific files again. For unit types that know a concept of "result" codes
this allows us to hook up the start limit condition to it with an explicit
result code. Also, this makes sure that the state checks in clal like
service_start() may be done before the start limit is checked, as the start
limit really should be checked last, right before everything has been verified
to be in order.

The generic start limit logic is left in unit.c, but the invocation of it is
moved into the per-type files, in the various xyz_start() functions, so that
they may place the check at the right location.

Note that this change drops the enforcement entirely from device, slice, target
and scope units, since these unit types generally may not fail activation, or
may only be activated a single time. This is also documented now.

Note that restores the "start-limit-hit" result code that existed before
6bf0f408e4 already in the service code. However,
it's not introduced for all units that have a result code concept.

Fixes #3166.
2016-05-02 13:08:00 +02:00
Zbigniew Jędrzejewski-Szmek ce99c68a33 Move no_instances information to shared/
This way it can be used in install.c in subsequent commit.
2016-05-01 19:58:59 -04:00
Zbigniew Jędrzejewski-Szmek 8a993b61d1 Move no_alias information to shared/
This way it can be used in install.c in subsequent commit.
2016-05-01 19:40:51 -04:00
Tejun Heo ccf78df1fc core: make unit_has_mask_realized() consider controller enable state
unit_has_mask_realized() determines whether the specified unit has its cgroups
set up properly given the desired target_mask; however, on the unified
hierarchy, controllers need to be enabled explicitly for children and the mask
of enabled controllers can deviate from target_mask.  Only considering
target_mask in unit_has_mask_realized() can lead to false positives and
skipping enabling the requested controllers.

This patch adds unit->cgroup_enabled_mask to track which controllers are
enabled and updates unit_has_mask_realized() to also consider enable_mask.

Signed-off-by: Tejun Heo <htejun@fb.com>
2016-04-30 16:12:54 -04:00
Lennart Poettering 291d565a04 core,systemctl: add bus API to retrieve processes of a unit
This adds a new GetProcesses() bus call to the Unit object which returns an
array consisting of all PIDs, their process names, as well as their full cgroup
paths. This is then used by "systemctl status" to show the per-unit process
tree.

This has the benefit that the client-side no longer needs to access the
cgroupfs directly to show the process tree of a unit. Instead, it now uses this
new API, which means it also works if -H or -M are used correctly, as the
information from the specific host is used, and not the one from the local
system.

Fixes: #2945
2016-04-22 16:06:20 +02:00
Lennart Poettering 4f4afc88ec core: rework how transient unit files and property drop-ins work
With this change the logic for placing transient unit files and drop-ins
generated via "systemctl set-property" is reworked.

The latter are now placed in the newly introduced "control" unit file
directory. The fomer are now placed in the "transient" unit file directory.

Note that the properties originally set when a transient unit was created will
be written to and stay in the transient unit file directory, while later
changes are done via drop-ins.

This is preparation for a later "systemctl revert" addition, where existing
drop-ins are flushed out, but the original transient definition is restored.
2016-04-12 13:43:32 +02:00
Martin Pitt 16a798deb3 Merge pull request #2569 from zonque/removals
Remove some old cruft
2016-02-10 14:01:46 +01:00
Daniel Mack b26fa1a2fb tree-wide: remove Emacs lines from all files
This should be handled fine now by .dir-locals.el, so need to carry that
stuff in every file.
2016-02-10 13:41:57 +01:00
Lennart Poettering 6bf0f408e4 core: make the StartLimitXYZ= settings generic and apply to any kind of unit, not just services
This moves the StartLimitBurst=, StartLimitInterval=, StartLimitAction=, RebootArgument= from the [Service] section
into the [Unit] section of unit files, and thus support it in all unit types, not just in services.

This way we can enforce the start limit much earlier, in particular before testing the unit conditions, so that
repeated start-up failure due to failed conditions is also considered for the start limit logic.

For compatibility the four options may also be configured in the [Service] section still, but we only document them in
their new section [Unit].

This also renamed the socket unit failure code "service-failed-permanent" into "service-start-limit-hit" to express
more clearly what it is about, after all it's only triggered through the start limit being hit.

Finally, the code in busname_trigger_notify() and socket_trigger_notify() is altered to become more alike.

Fixes: #2467
2016-02-10 13:26:56 +01:00
Lennart Poettering 7a7821c878 core: rework job_get_timeout() to use usec_t and handle USEC_INFINITY time events correctly 2016-02-04 00:35:43 +01:00
Lennart Poettering a483fb59a8 core: store for each unit when the last low-level unit state change took place
This adds a new timestamp field to the Unit struct, storing when the last low-level state change took place, and make
sure this is restored after a daemon reload. This new field is useful to allow restarting of per-state timers exactly
where they originally started.
2016-02-01 22:18:16 +01:00
Harald Hoyer 9d06297e26 core: Do not bind a mount unit to a device, if it was from mountinfo
If a mount unit is bound to a device, systemd tries to umount the
mount point, if it thinks the device has gone away.

Due to the uevent queue and inotify of /proc/self/mountinfo being two
different sources, systemd can never get the ordering reliably correct.

It can happen, that in the uevent queue ADD,REMOVE,ADD is queued
and an inotify of mountinfo (or libmount event) happend with the
device in question.

systemd cannot know, at which point of time the mount happend in the
ADD,REMOVE,ADD sequence.

The real ordering might have been ADD,REMOVE,ADD,mount
and systemd might think ADD,mount,REMOVE,ADD and would umount the
mountpoint.

A test script which triggered this behaviour is:
rm -f test-efi-disk.img
dd if=/dev/null of=test-efi-disk.img bs=1M seek=512 count=1
parted --script test-efi-disk.img \
  "mklabel gpt" \
  "mkpart ESP fat32 1MiB 511MiB" \
  "set 1 boot on"
LOOP=$(losetup --show -f -P test-efi-disk.img)
udevadm settle
mkfs.vfat -F32 ${LOOP}p1
mkdir -p mnt
mount ${LOOP}p1 mnt
... <dostuffwith mnt>

Without the "udevadm settle" systemd unmounted mnt while the script was
operating on mnt.

Of course the question is, why there was a REMOVE in the first place,
but this is not part of this patch.
2015-11-24 14:08:50 +01:00
Thomas Hindoe Paaboel Andersen 71d35b6b55 tree-wide: sort includes in *.h
This is a continuation of the previous include sort patch, which
only sorted for .c files.
2015-11-18 23:09:02 +01:00
Lennart Poettering 0f13f3bd79 core: move check whether a unit is suitable to become transient into unit.c
Lets introduce unit_is_pristine() that verifies whether a unit is
suitable to become a transient unit, by checking that it is no
referenced yet and has no data on disk assigned.
2015-11-17 17:32:49 +01:00
Lennart Poettering 702d4e6f14 core: now that .snapshot unit are gone, we don't need the per-type .no_gc bool anymore 2015-11-13 19:50:52 +01:00
Tom Gundersen 7042fc14ff Merge pull request #1837 from poettering/grabbag2
variety of fixes
2015-11-11 02:31:29 +01:00
Zbigniew Jędrzejewski-Szmek 36b4a7ba55 Remove snapshot unit type
Snapshots were never useful or used for anything. Many systemd
developers that I spoke to at systemd.conf2015, didn't even know they
existed, so it is fairly safe to assume that this type can be deleted
without harm.

The fundamental problem with snapshots is that the state of the system
is dynamic, devices come and go, users log in and out, timers fire...
and restoring all units to some state from the past would "undo"
those changes, which isn't really possible.

Tested by creating a snapshot, running the new binary, and checking
that the transition did not cause errors, and the snapshot is gone,
and snapshots cannot be created anymore.

New systemctl says:
Unknown operation snapshot.
Old systemctl says:
Failed to create snapshot: Support for snapshots has been removed.

IgnoreOnSnaphost settings are warned about and ignored:
Support for option IgnoreOnSnapshot= has been removed and it is ignored

http://lists.freedesktop.org/archives/systemd-devel/2015-November/034872.html
2015-11-10 19:33:06 -05:00
Lennart Poettering 9ff1a6f1d6 core: change type of distribute_fds() prototype to return void
We can't handle errors of thisc all sanely anyway, and we never actually
return any errors from the unit type that implements the call.  Hence,
let's make this void, in order to simplify things.
2015-11-10 21:03:49 +01:00
Lennart Poettering ba64af90ec core: change return value of the unit's enumerate() call to void
We cannot handle enumeration failures in a sensible way, hence let's try
hard to continue without making such failures fatal, and log about it
with precise error messages.
2015-11-10 21:03:49 +01:00
Thomas Hindoe Paaboel Andersen b250ea2fd6 tree-wide: remove unused functions 2015-10-19 21:46:01 +02:00