Commit Graph

201 Commits

Author SHA1 Message Date
Lennart Poettering a3c1168ac2 core: rework StopWhenUnneeded= logic
Previously, we'd act immediately on StopWhenUnneeded= when a unit state
changes. With this rework we'll maintain a queue instead: whenever
there's the chance that StopWhenUneeded= might have an effect we enqueue
the unit, and process it later when we have nothing better to do.

This should make the implementation a bit more reliable, as the unit notify event
cannot immediately enqueue tons of side-effect jobs that might
contradict each other, but we do so only in a strictly ordered fashion,
from the main event loop.

This slightly changes the check when to consider a unit "unneeded".
Previously, we'd assume that a unit in "deactivating" state could also
be cleaned up. With this new logic we'll only consider units unneeded
that are fully up and have no job queued. This means that whenever
there's something pending for a unit we won't clean it up.
2018-08-10 16:19:01 +02:00
Yu Watanabe 845d247a3d tree-wide: use 'signed int' instead of 'int' for bit field variables
Suggested by LGTM: https://lgtm.com/rules/1506024027114/
2018-06-28 10:16:51 +02:00
Lennart Poettering 0c69794138 tree-wide: remove Lennart's copyright lines
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
2018-06-14 10:20:20 +02:00
Lennart Poettering 818bf54632 tree-wide: drop 'This file is part of systemd' blurb
This part of the copyright blurb stems from the GPL use recommendations:

https://www.gnu.org/licenses/gpl-howto.en.html

The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.

hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
2018-06-14 10:20:20 +02:00
Lennart Poettering ef31828d06 tree-wide: unify how we define bit mak enums
Let's always write "1 << 0", "1 << 1" and so on, except where we need
more than 31 flag bits, where we write "UINT64(1) << 0", and so on to force
64bit values.
2018-06-12 21:44:00 +02:00
Lennart Poettering 04eb582acc core: enumerate perpetual units in a separate per-unit-type method
Previously the enumerate() callback defined for each unit type would do
two things:

1. It would create perpetual units (i.e. -.slice, system.slice, -.mount and
   init.scope)

2. It would enumerate units from /proc/self/mountinfo, /proc/swaps and
   the udev database

With this change these two parts are split into two seperate methods:
enumerate() now only does #2, while enumerate_perpetual() is responsible
for #1. Why make this change? Well, perpetual units should have a
slightly different effect that those found through enumeration: as
perpetual units should be up unconditionally, perpetually and thus never
change state, they should also not pull in deps by their state changing,
not even when the state is first set to active. Thus, their state is
generally initialized through the per-device coldplug() method in
similar  fashion to the deserialized state from a previous run would be
put into place. OTOH units found through regular enumeration should
result in state changes (and thus pull in deps due to state changes),
hence their state should be put in effect in the catchup() method
instead. Hence, given this difference, let's also separate the
functions, so that the rule is:

1. What is created in enumerate_perpetual() should be started in
   coldplug()

2. What is created in enumerate() should be started in catchup().
2018-06-07 15:29:17 +02:00
Lennart Poettering f0831ed2a0 core: add a new unit method "catchup()"
This is very similar to the existing unit method coldplug() but is
called a bit later. The idea is that that coldplug() restores the unit
state from before any prior reload/restart, i.e. puts the deserialized
state in effect. The catchup() call is then called a bit later, to
catch up with the system state for which we missed notifications while
we were reloading. This is only really useful for mount, swap and device
mount points were we should be careful to generate all missing unit
state change events (i.e. call unit_notify() appropriately) for
everything that happened while we were reloading.
2018-06-07 15:28:50 +02:00
Lennart Poettering bbf5fd8e41 core: subscribe to /etc/localtime timezone changes and update timer elapsation accordingly
Fixes: #8233

This is our first real-life usecase for the new sd_event_add_inotify()
calls we just added.
2018-06-06 10:53:56 +02:00
Lennart Poettering 50be4f4a46 core: rework how we track service and scope PIDs
This reworks how systemd tracks processes on cgroupv1 systems where
cgroup notification is not reliable. Previously, whenever we had reason
to believe that new processes showed up or got removed we'd scan the
cgroup of the scope or service unit for new processes, and would tidy up
the list of PIDs previously watched. This scanning is relatively slow,
and does not scale well. With this change behaviour is changed: instead
of scanning for new/removed processes right away we do this work in a
per-unit deferred event loop job. This event source is scheduled at a
very low priority, so that it is executed when we have time but does not
starve other event sources. This has two benefits: this expensive work is
coalesced, if events happen in quick succession, and we won't delay
SIGCHLD handling for too long.

This patch basically replaces all direct invocation of
unit_watch_all_pids() in scope.c and service.c with invocations of the
new unit_enqueue_rewatch_pids() call which just enqueues a request of
watching/tidying up the PID sets (with one exception: in
scope_enter_signal() and service_enter_signal() we'll still do
unit_watch_all_pids() synchronously first, since we really want to know
all processes we are about to kill so that we can track them properly.

Moreover, all direct invocations of unit_tidy_watch_pids() and
unit_synthesize_cgroup_empty_event() are removed too, when the
unit_enqueue_rewatch_pids() call is invoked, as the queued job will run
those operations too.

All of this is done on cgroupsv1 systems only, and is disabled on
cgroupsv2 systems as cgroup-empty notifications are reliable there, and
we do not need SIGCHLD events to track processes there.

Fixes: #9138
2018-06-05 22:06:48 +02:00
Lennart Poettering 2ad2e41a72 core: don't trigger OnFailure= deps when a unit is going to restart
This adds a flags parameter to unit_notify() which can be used to pass
additional notification information to the function. We the make the old
reload_failure boolean parameter one of these flags, and then add a new
flag that let's unit_notify() if we are configured to restart the
service.

Note that this adjusts behaviour of systemd to match what the docs say.

Fixes: #8398
2018-06-01 19:08:30 +02:00
Felipe Sateler 57b7a260c2 core: undo the dependency inversion between unit.h and all unit types 2018-05-15 14:24:34 -04:00
Lennart Poettering d4fd1cf208 core: enforce that scope units can be started only once
Scope units are populated from PIDs specified by the bus client. We do
that when a scope is started. We really shouldn't allow scopes to be
started multiple times, as the PIDs then might be heavily out of date.
Moreover, clients should have the guarantee that any scope they allocate
has a clear runtime cycle which is not repetitive.
2018-04-27 21:52:45 +02:00
Zbigniew Jędrzejewski-Szmek 11a1589223 tree-wide: drop license boilerplate
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.

I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
2018-04-06 18:58:55 +02:00
Michal Sekletar 19496554e2 core: delay adding target dependencies until all units are loaded and aliases resolved (#8381)
Currently we add target dependencies while we are loading units. This
can create ordering loops even if configuration doesn't contain any
loop. Take for example following configuration,

$ systemctl get-default
multi-user.target

$ cat /etc/systemd/system/test.service
[Unit]
After=default.target

[Service]
ExecStart=/bin/true

[Install]
WantedBy=multi-user.target

If we encounter such unit file early during manager start-up (e.g. load
queue is dispatched while enumerating devices due to SYSTEMD_WANTS in
udev rules) we would add stub unit default.target and we order it Before
test.service. At the same time we add implicit Before to
multi-user.target. Later we merge two units and we create ordering cycle
in the process.

To fix the issue we will now never add any target dependencies until we
loaded all the unit files and resolved all the aliases.
2018-03-23 15:28:06 +01:00
Zbigniew Jędrzejewski-Szmek dc409696cf Introduce _cleanup_(unit_freep) 2018-03-11 16:33:58 +01:00
Lennart Poettering aa2b6f1d2b bpf: rework how we keep track and attach cgroup bpf programs
So, the kernel's management of cgroup/BPF programs is a bit misdesigned:
if you attach a BPF program to a cgroup and close the fd for it it will
stay pinned to the cgroup with no chance of ever removing it again (or
otherwise getting ahold of it again), because the fd is used for
selecting which BPF program to detach. The only way to get rid of the
program again is to destroy the cgroup itself.

This is particularly bad for root the cgroup (and in fact any other
cgroup that we cannot realistically remove during runtime, such as
/system.slice, /init.scope or /system.slice/dbus.service) as getting rid
of the program only works by rebooting the system.

To counter this let's closely keep track to which cgroup a BPF program
is attached and let's implicitly detach the BPF program when we are
about to close the BPF fd.

This hence changes the bpf_program_cgroup_attach() function to track
where we attached the program and changes bpf_program_cgroup_detach() to
use this information. Moreover bpf_program_unref() will now implicitly
call bpf_program_cgroup_detach().

In order to simplify things, bpf_program_cgroup_attach() will now
implicitly invoke bpf_program_load_kernel() when necessary, simplifying
the caller's side.

Finally, this adds proper reference counting to BPF programs. This
is useful for working with two BPF programs in parallel: the BPF program
we are preparing for installation and the BPF program we so far
installed, shortening the window when we detach the old one and reattach
the new one.
2018-02-21 16:43:36 +01:00
Lennart Poettering a94ab7acfd
Merge pull request #8175 from keszybz/gc-cleanup
Garbage collection cleanup
2018-02-15 17:47:37 +01:00
Zbigniew Jędrzejewski-Szmek 7f7d01ed58 pid1: include the source unit in UnitRef
No functional change.

The source unit manages the reference. It allocates the UnitRef structure and
registers it in the target unit, and then the reference must be destroyed
before the source unit is destroyed. Thus, is should be OK to include the
pointer to the source unit, it should be live as long as the reference exists.

v2:
- rename refs to refs_by_target
2018-02-15 13:27:06 +01:00
Zbigniew Jędrzejewski-Szmek f2f725e5cc pid1: rename unit_check_gc to unit_may_gc
"check" is unclear: what is true, what is false? Let's rename to "can_gc" and
revert the return value ("positive" values are easier to grok).

v2:
- rename from unit_can_gc to unit_may_gc
2018-02-15 13:04:12 +01:00
Lennart Poettering 6592b9759c core: add new new bus call for migrating foreign processes to scope/service units
This adds a new bus call to service and scope units called
AttachProcesses() that moves arbitrary processes into the cgroup of the
unit. The primary user for this new API is systemd itself: the systemd
--user instance uses this call of the systemd --system instance to
migrate processes if itself gets the request to migrate processes and
the kernel refuses this due to access restrictions.

The primary use-case of this is to make "systemd-run --scope --user …"
invoked from user session scopes work correctly on pure cgroupsv2
environments. There, the kernel refuses to migrate processes between two
unprivileged-owned cgroups unless the requestor as well as the ownership
of the closest parent cgroup all match. This however is not the case
between the session-XYZ.scope unit of a login session and the
user@ABC.service of the systemd --user instance.

The new logic always tries to move the processes on its own, but if
that doesn't work when being the user manager, then the system manager
is asked to do it instead.

The new operation is relatively restrictive: it will only allow to move
the processes like this if the caller is root, or the UID of the target
unit, caller and process all match. Note that this means that
unprivileged users cannot attach processes to scope units, as those do
not have "owning" users (i.e. they have now User= field).

Fixes: #3388
2018-02-12 11:34:00 +01:00
Lennart Poettering 1d9cc8768f cgroup: add a new "can_delegate" flag to the unit vtable, and set it for scope and service units only
Currently we allowed delegation for alluntis with cgroup backing
except for slices. Let's make this a bit more strict for now, and only
allow this in service and scope units.

Let's also add a generic accessor unit_cgroup_delegate() for checking
whether a unit has delegation turned on that checks the new bool first.

Also, when doing transient units, let's explcitly refuse turning on
delegation for unit types that don#t support it. This is mostly
cosmetical as we wouldn't act on the delegation request anyway, but
certainly helpful for debugging.
2018-02-12 11:34:00 +01:00
Lennart Poettering 81e9871e87 selinux: make sure we never use /dev/null for making unit selinux access decisions 2018-01-31 19:54:25 +01:00
Lennart Poettering adefcf2821 core: rework how we count the n_on_console counter
Let's add a per-unit boolean that tells us whether our unit is currently
counted or not. This way it's unlikely we get out of sync again and
things are generally more robust.

This also allows us to remove the counting logic specific to service
units (which was in fact mostly a copy from the generic implementation),
in favour of fully generic code.

Replaces: #7824
2018-01-24 20:14:51 +01:00
Lennart Poettering bb2c768545 core: add a new unit_needs_console() call
This call determines whether a specific unit currently needs access to
the console. It's a fancy wrapper around
exec_context_may_touch_console() ultimately, however for service units
we'll explicitly exclude the SERVICE_EXITED state from when we report
true.
2018-01-24 19:54:26 +01:00
Lennart Poettering 62a769136d core: rework how we track which PIDs to watch for a unit
Previously, we'd maintain two hashmaps keyed by PIDs, pointing to Unit
interested in SIGCHLD events for them. This scheme allowed a specific
PID to be watched by exactly 0, 1 or 2 units.

With this rework this is replaced by a single hashmap which is primarily
keyed by the PID and points to a Unit interested in it. However, it
optionally also keyed by the negated PID, in which case it points to a
NULL terminated array of additional Unit objects also interested. This
scheme means arbitrary numbers of Units may now watch the same PID.

Runtime and memory behaviour should not be impact by this change, as for
the common case (i.e. each PID only watched by a single unit) behaviour
stays the same, but for the uncommon case (a PID watched by more than
one unit) we only pay with a single additional memory allocation for the
array.

Why this all? Primarily, because allowing exactly two units to watch a
specific PID is not sufficient for some niche cases, as processes can
belong to more than one unit these days:

1. sd_notify() with MAINPID= can be used to attach a process from a
   different cgroup to multiple units.

2. Similar, the PIDFile= setting in unit files can be used for similar
   setups,

3. By creating a scope unit a main process of a service may join a
   different unit, too.

4. On cgroupsv1 we frequently end up watching all processes remaining in
   a scope, and if a process opens lots of scopes one after the other it
   might thus end up being watch by many of them.

This patch hence removes the 2-unit-per-PID limit. It also makes a
couple of other changes, some of them quite relevant:

- manager_get_unit_by_pid() (and the bus call wrapping it) when there's
  ambiguity will prefer returning the Unit the process belongs to based on
  cgroup membership, and only check the watch-pids hashmap if that
  fails. This change in logic is probably more in line with what people
  expect and makes things more stable as each process can belong to
  exactly one cgroup only.

- Every SIGCHLD event is now dispatched to all units interested in its
  PID. Previously, there was some magic conditionalization: the SIGCHLD
  would only be dispatched to the unit if it was only interested in a
  single PID only, or the PID belonged to the control or main PID or we
  didn't dispatch a signle SIGCHLD to the unit in the current event loop
  iteration yet. These rules were quite arbitrary and also redundant as
  the the per-unit handlers would filter the PIDs anyway a second time.
  With this change we'll hence relax the rules: all we do now is
  dispatch every SIGCHLD event exactly once to each unit interested in
  it, and it's up to the unit to then use or ignore this. We use a
  generation counter in the unit to ensure that we only invoke the unit
  handler once for each event, protecting us from confusion if a unit is
  both associated with a specific PID through cgroup membership and
  through the "watch_pids" logic. It also protects us from being
  confused if the "watch_pids" hashmap is altered while we are
  dispatching to it (which is a very likely case).

- sd_notify() message dispatching has been reworked to be very similar
  to SIGCHLD handling now. A generation counter is used for dispatching
  as well.

This also adds a new test that validates that "watch_pid" registration
and unregstration works correctly.
2018-01-23 21:29:31 +01:00
Alan Jenkins 25cd49647c mount: forbid mount on path with symlinks
It was forbidden to create mount units for a symlink.  But the reason is
that the mount unit needs to know the real path that will appear in
/proc/self/mountinfo.  The kernel dereferences *all* the symlinks in the
path at mount time (I checked this with `mount -c` running under `strace`).

This will have no effect on most systems.  As recommended by docs, most
systems use /etc/fstab, as opposed to native mount unit files.
fstab-generator dereferences symlinks for backwards compatibility.

A relatively minor issue regarding Time Of Check / Time Of Use also exists
here.  I can't see how to get rid of it entirely.  If we pass an absolute
path to mount, the racing process can replace it with a symlink.  If we
chdir() to the mount point and pass ".", the racing process can move the
directory.  The latter might potentially be nicer, except that it breaks
WorkingDirectory=.

I'm not saying the race is relevant to security - I just want to consider
how bad the effect is.  Currently, it can make the mount unit active (and
hence the job return success), despite there never being a matching entry
in /proc/self/mountinfo.  This wart will be removed in the next commit;
i.e. it will make the mount unit fail instead.
2018-01-20 22:06:34 +00:00
Lennart Poettering db256aab13 core: be stricter when handling PID files and MAINPID sd_notify() messages
Let's be more restrictive when validating PID files and MAINPID=
messages: don't accept PIDs that make no sense, and if the configuration
source is not trusted, don't accept out-of-cgroup PIDs. A configuratin
source is considered trusted when the PID file is owned by root, or the
message was received from root.

This should lock things down a bit, in case service authors write out
PID files from unprivileged code or use NotifyAccess=all with
unprivileged code. Note that doing so was always problematic, just now
it's a bit less problematic.

When we open the PID file we'll now use the CHASE_SAFE chase_symlinks()
logic, to ensure that we won't follow an unpriviled-owned symlink to a
privileged-owned file thinking this was a valid privileged PID file,
even though it really isn't.

Fixes: #6632
2018-01-11 15:12:16 +01:00
Lennart Poettering 4c253ed1ca tree-wide: introduce new safe_fork() helper and port everything over
This adds a new safe_fork() wrapper around fork() and makes use of it
everywhere. The new wrapper does a couple of things we previously did
manually and separately in a safer, more correct and automatic way:

1. Optionally resets signal handlers/mask in the child

2. Sets a name on all processes we fork off right after forking off (and
   the patch assigns useful names for all processes we fork off now,
   following a systematic naming scheme: always enclosed in () – in order
   to indicate that these are not proper, exec()ed processes, but only
   forked off children, and if the process is long-running with only our
   own code, without execve()'ing something else, it gets am "sd-" prefix.)

3. Optionally closes all file descriptors in the child

4. Optionally sets a PR_SET_DEATHSIG to SIGTERM in the child, in a safe
   way so that the parent dying before this happens being handled
   safely.

5. Optionally reopens the logs

6. Optionally connects stdin/stdout/stderr to /dev/null

7. Debug logs about the forked off processes.
2017-12-25 11:48:21 +01:00
Michal Koutný deb4e7080d service: Don't stop unneeded units needed by restarted service (#7526)
An auto-restarted unit B may depend on unit A with StopWhenUnneeded=yes.
If A stops before B's restart timeout expires, it'll be started again as part
of B's dependent jobs. However, if stopping takes longer than the timeout, B's
running stop job collides start job which also cancels B's start job. Result is
that neither A or B are active.

Currently, when a service with automatic restarting fails, it transitions
through following states:
        1) SERVICE_FAILED or SERVICE_DEAD to indicate the failure,
        2) SERVICE_AUTO_RESTART while restart timer is running.

The StopWhenUnneeded= check takes place in service_enter_dead between the two
state mentioned above. We temporarily store the auto restart flag to query it
during the check. Because we don't return control to the main event loop, this
new service unit flag needn't be serialized.

This patch prevents the pathologic situation when the service with Restart=
won't restart automatically. As a side effect it also avoid restarting the
dependency unit with StopWhenUnneeded=yes.

Fixes: #7377
2017-12-05 16:51:19 +01:00
Lennart Poettering 2e59b241ca core: add proper escaping to writing of drop-ins/transient unit files
This majorly refactors the transient unit file and drop-in writing
logic, so that we properly C-escape and specifier-escape (% → %%)
everything we write out, so that when we read it back again, specifiers
are parsed that aren't supposed to be parsed.

This renames unit_write_drop_in() and friends by unit_write_setting().
The name change is supposed to clarify that the functions are not only
used to write drop-in files, but also transient unit files.

The previous "mode" parameter to this function is replaced by a more
generic "flags", which knows additional flags for implicit C-style and
specifier escaping before writing things out. This can cover most
properties where either form of escaping is defined. For the cases where
this isn't sufficient, we add helpers unit_escape_setting() and
unit_concat_strv() for escaping individual strings or strvs properly.

While we are at it, we also prettify generation of transient unit files:
we try to reduce the number of section headers written out: previously
we'd write the right section header our for each setting. With this
change we do so only if the setting lives in a different section than
the one before.

(This should also be considered preparation for when we add proper APIs
to systemd to write normal, persistant unit files through the bus API)
2017-11-29 12:34:12 +01:00
Lennart Poettering a4634b214c core: warn about left-over processes in cgroup on unit start
Now that we don't kill control processes anymore, let's at least warn
about any processes left-over in the unit cgroup at the moment of
starting the unit.
2017-11-25 17:08:21 +01:00
Zbigniew Jędrzejewski-Szmek ffb70e4424
Merge pull request #7381 from poettering/cgroup-unified-delegate-rework
Fix delegation in the unified hierarchy + more cgroup work
2017-11-22 07:42:08 +01:00
Lennart Poettering 3c7416b6ca core: unify common code for preparing for forking off unit processes
This introduces a new function unit_prepare_exec() that encapsulates a
number of calls we do in preparation for spawning off some processes in
all our unit types that do so.

This allows us to neatly unify a bit of code between unit types and
shorten our code.
2017-11-21 11:54:08 +01:00
Lennart Poettering e7dfbb4e74 core: introduce SuccessAction= as unit file property
SuccessAction= is similar to FailureAction= but declares what to do on
success of a unit, rather than on failure. This is useful for running
commands in qemu/nspawn images, that shall power down on completion. We
frequently see "ExecStopPost=/usr/bin/systemctl poweroff" or so in unit
files like this. Offer a simple, more declarative alternative for this.

While we are at it, hook up failure action with unit_dump() and
transient units too.
2017-11-20 16:37:22 +01:00
Lennart Poettering 53c35a766f core: generalize FailureAction= move it from service to unit
All kinds of units can fail, hence it makes sense to offer this as
generic concept for all unit types.
2017-11-20 16:37:22 +01:00
Zbigniew Jędrzejewski-Szmek 53e1b68390 Add SPDX license identifiers to source files under the LGPL
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
2017-11-19 19:08:15 +01:00
Lennart Poettering 5afe510c89 core: add a new unit file setting CollectMode= for tweaking the GC logic
Right now, the option only takes one of two possible values "inactive"
or "inactive-or-failed", the former being the default, and exposing same
behaviour as the status quo ante. If set to "inactive-or-failed" units
may be collected by the GC logic when in the "failed" state too.

This logic should be a nicer alternative to using the "-" modifier for
ExecStart= and friends, as the exit data is collected and logged about
and only removed when the GC comes along. This should be useful in
particular for per-connection socket-activated services, as well as
"systemd-run" command lines that shall leave no artifacts in the
system.

I was thinking about whether to expose this as a boolean, but opted for
an enum instead, as I have the suspicion other tweaks like this might be
a added later on, in which case we extend this setting instead of having
to add yet another one.

Also, let's add some documentation for the GC logic.
2017-11-16 14:38:36 +01:00
Lennart Poettering 7eb2a8a125 unit: rework a bit how we keep the service fdstore from being destroyed during service restart
When preparing for a restart we quickly go through the DEAD/INACTIVE
service state before entering AUTO_RESTART. When doing this, we need to
make sure we don't destroy the FD store. Previously this was done by
checking the failure state of the unit, and keeping the FD store around
when the unit failed, under the assumption that the restart logic will
then get into action.

This is not entirely correct howver, as there might be failure states
that will no result in restarts.

With this commit we slightly alter the logic: a ref counter for the fd
store is added, that is increased right before we handle the restart
logic, and decreased again right-after.

This should ensure that the fdstore lives exactly as long as it needs.

Follow-up for f0bfbfac43.
2017-11-16 14:37:33 +01:00
Lennart Poettering d3070fbdf6 core: implement /run/systemd/units/-based path for passing unit info from PID 1 to journald
And let's make use of it to implement two new unit settings with it:

1. LogLevelMax= is a new per-unit setting that may be used to configure
   log priority filtering: set it to LogLevelMax=notice and only
   messages of level "notice" and lower (i.e. more important) will be
   processed, all others are dropped.

2. LogExtraFields= is a new per-unit setting for configuring per-unit
   journal fields, that are implicitly included in every log record
   generated by the unit's processes. It takes field/value pairs in the
   form of FOO=BAR.

Also, related to this, one exisiting unit setting is ported to this new
facility:

3. The invocation ID is now pulled from /run/systemd/units/ instead of
   cgroupfs xattrs. This substantially relaxes requirements of systemd
   on the kernel version and the privileges it runs with (specifically,
   cgroupfs xattrs are not available in containers, since they are
   stored in kernel memory, and hence are unsafe to permit to lesser
   privileged code).

/run/systemd/units/ is a new directory, which contains a number of files
and symlinks encoding the above information. PID 1 creates and manages
these files, and journald reads them from there.

Note that this is supposed to be a direct path between PID 1 and the
journal only, due to the special runtime environment the journal runs
in. Normally, today we shouldn't introduce new interfaces that (mis-)use
a file system as IPC framework, and instead just an IPC system, but this
is very hard to do between the journal and PID 1, as long as the IPC
system is a subject PID 1 manages, and itself a client to the journal.

This patch cleans up a couple of types used in journal code:
specifically we switch to size_t for a couple of memory-sizing values,
as size_t is the right choice for everything that is memory.

Fixes: #4089
Fixes: #3041
Fixes: #4441
2017-11-16 12:40:17 +01:00
Lennart Poettering c999cf385a core: add internal API to remove dependencies again, based on dependency mask
let's make use of the dependency mask, and add internal API to remove
dependencies ago, based on bits in the dependency mask.
2017-11-10 19:45:29 +01:00
Lennart Poettering eef85c4a3f core: track why unit dependencies came to be
This replaces the dependencies Set* objects by Hashmap* objects, where
the key is the depending Unit, and the value is a bitmask encoding why
the specific dependency was created.

The bitmask contains a number of different, defined bits, that indicate
why dependencies exist, for example whether they are created due to
explicitly configured deps in files, by udev rules or implicitly.

Note that memory usage is not increased by this change, even though we
store more information, as we manage to encode the bit mask inside the
value pointer each Hashmap entry contains.

Why this all? When we know how a dependency came to be, we can update
dependencies correctly when a configuration source changes but others
are left unaltered. Specifically:

1. We can fix UDEV_WANTS dependency generation: so far we kept adding
   dependencies configured that way, but if a device lost such a
   dependency we couldn't them again as there was no scheme for removing
   of dependencies in place.

2. We can implement "pin-pointed" reload of unit files. If we know what
   dependencies were created as result of configuration in a unit file,
   then we know what to flush out when we want to reload it.

3. It's useful for debugging: "systemd-analyze dump" now shows
   this information, helping substantially with understanding how
   systemd's dependency tree came to be the way it came to be.
2017-11-10 19:45:29 +01:00
Lennart Poettering eae51da36e unit: when JobTimeoutSec= is turned off, implicitly turn off JobRunningTimeoutSec= too
We added JobRunningTimeoutSec= late, and Dracut configured only
JobTimeoutSec= to turn of root device timeouts before. With this change
we'll propagate a reset of JobTimeoutSec= into JobRunningTimeoutSec=,
but only if the latter wasn't set explicitly.

This should restore compatibility with older systemd versions.

Fixes: #6402
2017-10-05 13:06:44 +02:00
Andreas Rammhold 3742095b27
tree-wide: use IN_SET where possible
In addition to the changes from #6933 this handles cases that could be
matched with the included cocci file.
2017-10-02 13:09:54 +02:00
Lennart Poettering 09e2465407 cgroup: after determining that a cgroup is empty, asynchronously dispatch this
This makes sure that if we learn via inotify or another event source
that a cgroup is empty, and we checked that this is indeed the case (as
we might get spurious notifications through inotify, as the inotify
logic through the "cgroups.event" is pretty unspecific and might be
trigger for a variety of reasons), then we'll enqueue a defer event for
it, at a priority lower than SIGCHLD handling, so that we know for sure
that if there's waitid() data for a process we used it before
considering the cgroup empty notification.

Fixes: #6608
2017-09-27 18:26:18 +02:00
Lennart Poettering 91a6073ef7 core: rename cgroup_queue → cgroup_realize_queue
We are about to add second cgroup-related queue, called
"cgroup_empty_queue", hence let's rename "cgroup_queue" to
"cgroup_realize_queue" (as that is its purpose) to minimize confusion
about the two queues.

Just a rename, no functional changes.
2017-09-27 17:59:25 +02:00
Lennart Poettering 6d330fef4d unit: remove unused fields from Unit structure 2017-09-27 17:59:25 +02:00
Lennart Poettering f1c50becda core: make sure to log invocation ID of units also when doing structured logging 2017-09-22 15:24:55 +02:00
Lennart Poettering 6b659ed87e core: serialize/deserialize IP accounting across daemon reload/reexec
Make sure the current IP accounting counters aren't lost during
reload/reexec.

Note that we destroy all BPF file objects during a reload: the BPF
programs, the access and the accounting maps. The former two need to be
regenerated anyway with the newly loaded configuration data, but the
latter one needs to survive reloads/reexec. In this implementation I
opted to only save/restore the accounting map content instead of the map
itself. While this opens a (theoretic) window where IP traffic is still
accounted to the old map after we read it out, and we thus miss a few
bytes this has the benefit that we can alter the map layout between
versions should the need arise.
2017-09-22 15:24:55 +02:00
Lennart Poettering a79279c7fd core: when creating the socket fds for a socket unit, join socket's cgroup first
Let's make sure that a socket unit's IPAddressAllow=/IPAddressDeny=
settings are in effect on all socket fds associated with it. In order to
make this happen we need to make sure the cgroup the fds are associated
with are the socket unit's cgroup. The only way to do that is invoking
socket()+accept() in them. Since we really don't want to migrate PID 1
around we do this by forking off a helper process, which invokes
socket()/accept() and sends the newly created fd to PID 1. Ugly, but
works, and there's apparently no better way right now.

This generalizes forking off per-unit helper processes in a new function
unit_fork_helper_process(), which is then also used by the NSS chown()
code of socket units.
2017-09-22 15:24:55 +02:00
Daniel Mack 906c06f64a cgroup, unit, fragment parser: make use of new firewall functions 2017-09-22 15:24:55 +02:00