Running systemctl enable/disable/set-default/... with the --root
option under strace reveals that it accessed various files and
directories in the main fs, and not underneath the specified root.
This can lead to correct results only when the layout and
configuration in the container are identical, which often is not the
case. Fix this by adding the specified root to all file access
operations.
This patch does not handle some corner cases: symlinks which point
outside of the specified root might be interpreted differently than
they would be by the kernel if the specified root was the real root.
But systemctl does not create such symlinks by itself, and I think
this is enough of a corner case not to be worth the additional
complexity of reimplementing link chasing in systemd.
Also, simplify the code in a few places and remove an hypothetical
memory leak on error.
safe_close_pair() is more like safe_close(), except that it handles
pairs of fds, and doesn't make and misleading allusion, as it works
similarly well for socketpairs() as for pipe()s...
A service with PrivateNetwork= cannot access abstract namespace sockets
of the host anymore, hence let's better not use abstract namespace
sockets for this, since we want to make sure that PrivateNetwork=
is useful and doesn't break sd_notify().
safe_close() automatically becomes a NOP when a negative fd is passed,
and returns -1 unconditionally. This makes it easy to write lines like
this:
fd = safe_close(fd);
Which will close an fd if it is open, and reset the fd variable
correctly.
By making use of this new scheme we can drop a > 200 lines of code that
was required to test for non-negative fds or to reset the closed fd
variable afterwards.
The system state knows the states starting →
running/degraded/maintenance → stopping, where:
starting = system startup
running = normal operation
degraded = at least one unit is currently in failed state
maintenance = rescue/emergency mode is active or queued
stopping = system shutdown
When the manager receives a SIGUSR2 signal, it opens a memory stream
with open_memstream(), uses the returned file handle for logging, and
dumps the logged content with log_dump().
However, the char* buffer is only safe to use after the file handle has
been flushed with fflush, as the man pages states:
When the stream is closed (fclose(3)) or flushed (fflush(3)), the
locations pointed to by ptr and sizeloc are updated to contain,
respectively, a pointer to the buffer and the current size of the
buffer.
These values remain valid only as long as the caller performs no
further output on the stream. If further output is performed, then the
stream must again be flushed before trying to access these variables.
Without that call, dump remains NULL and the daemon crashes in
log_dump().
This is primarily useful for services that need to track clients which
reference certain objects they maintain, or which explicitly want to
subscribe to certain events. Something like this is done in a large
number of services, and not trivial to do. Hence, let's unify this at
one place.
This also ports over PID 1 to use this to ensure that subscriptions to
job and manager events are correctly tracked. As a side-effect this
makes sure we properly serialize and restore the track list across
daemon reexec/reload, which didn't work correctly before.
This also simplifies how we distribute messages to broadcast to the
direct busses: we only track subscriptions for the API bus and
implicitly assume that all direct busses are subscribed. This should be
a pretty OK simplification since clients connected via direct bus
connections are shortlived anyway.
Previously the returned object of constructor functions where sometimes
returned as last, sometimes as first and sometimes as second parameter.
Let's clean this up a bit. Here are the new rules:
1. The object the new object is derived from is put first, if there is any
2. The object we are creating will be returned in the next arguments
3. This is followed by any additional arguments
Rationale:
For functions that operate on an object we always put that object first.
Constructors should probably not be too different in this regard. Also,
if the additional parameters might want to use varargs which suggests to
put them last.
Note that this new scheme only applies to constructor functions, not to
all other functions. We do give a lot of freedom for those.
Note that this commit only changes the order of the new functions we
added, for old ones we accept the wrong order and leave it like that.
In some cases it is interesting to map a PID to two units at the same
time. For example, when a user logs in via a getty, which is reexeced to
/sbin/login that binary will be explicitly referenced as main pid of the
getty service, as well as implicitly referenced as part of the session
scope.
When a process dies that we can associate with a specific unit, start
watching all other processes of that unit, so that we can associate
those processes with the unit too.
Also, for service units start doing this as soon as we get the first
SIGCHLD for either control or main process, so that we can follow the
processes of the service from one to the other, as long as process that
remain are processes of the ones we watched that died and got reassigned
to us as parent.
Similar, for scope units start doing this as soon as the scope
controller abandons the unit, and thus management entirely reverts to
systemd. To abandon a unit introduce a new Abandon() scope unit method
call.
This reverts commit 28c758de94
but makes job_coldplug smarter.
In (v1) I changed the job start timestamp to be always set, so the
start time can be reported in the cylon eye message. The bug was that
when deserializing jobs, they would be ignored if their start
timestamp was unset which was synonymous with no timeout. But after
the change, jobs would have a start timestamp set despite having no
timeout. After deserialization they would be considered immediately
expired. Fix this by checking if the timeout is not zero when
considering jobs for expiration.
When set to auto, status will shown when the first ephemeral message
is shown (a job has been running for five seconds). Then until the
boot or shutdown ends, status messages will be shown.
No indication about the switch is done: I think it should be clear
for the user that first the cylon eye and the ephemeral messages appear,
and afterwards messages are displayed.
The initial arming of the event source was still wrong, but now should
really be fixed.
It would fire just once.
Also fix units from sec to usec as appropriate.
Decrease the switching interval to 1/3 s, so that when the time
remaining is displayed with 1s precision, it doesn't jump by 2s every
once in a while. Also, the system is feels noticably faster when the
status changes couple of times per second instead of every few
seconds.
Produces output like:
[ *** ] (1 of 2) A start job is running for slow.service (33s / 1min 30s)
The first nubmer is the time since job start, the second is the job timeout.
It is nicer to predefine patterns using configure time check instead of
using casts everywhere.
Since we do not need to use any flags, include "%" in the format instead
of excluding it like PRI* macros.
Information about signals which are not routinely received by systemd
are printed at info level. This should make it easier to see what is
happening in the system.
The only problem is that libgen.h #defines basename to point to it's
own broken implementation instead of the GNU one. This can be fixed
by #undefining basename.
Since we want to retain the ability to break kernel ←→ userspace ABI
after the next release, let's not make use by default of kdbus, so that
people with future kernels will not suddenly break with current systemd
versions.
kdbus support is left in all builds but must now be explicitly requested
at runtime (for example via setting $DBUS_SESSION_BUS). Via a configure
switch the old behaviour can be restored. In fact, we change autogen.sh
to do this, so that git builds (which run autogen.sh) get kdbus by
default, but tarball builds (which ue the configure defaults) do not get
it, and hence this stays out of the distros by default.
This way, we can avoid executing two /bin/swapon jobs to be dispatched
for the same swap device if it is configured for two different paths.
Previously we were just tracking the device nodes of active swap
devices, which would not allow us to recognize the identity of two swap
devices before they are active.
https://bugs.freedesktop.org/show_bug.cgi?id=69835
Given that plymouth listens on an abstract namespace socket and if
CLONE_NEWNET is not used the abstract namespace is shared with the host
we might actually end up send plymouth data to the host.
This patch converts PID 1 to libsystemd-bus and thus drops the
dependency on libdbus. The only remaining code using libdbus is a test
case that validates our bus marshalling against libdbus' marshalling,
and this dependency can be turned off.
This patch also adds a couple of things to libsystem-bus, that are
necessary to make the port work:
- Synthesizing of "Disconnected" messages when bus connections are
severed.
- Support for attaching multiple vtables for the same interface on the
same path.
This patch also fixes the SetDefaultTarget() and GetDefaultTarget() bus
calls which used an inappropriate signature.
As a side effect we will now generate PropertiesChanged messages which
carry property contents, rather than just invalidation information.
$ touch src/core/dbus.c; make CFLAGS=-O0
make --no-print-directory all-recursive
Making all in .
CC src/core/libsystemd_core_la-dbus.lo
CCLD libsystemd-core.la
$ touch src/core/dbus.c; make CFLAGS=-Og
make --no-print-directory all-recursive
Making all in .
CC src/core/libsystemd_core_la-dbus.lo
src/core/dbus.c: In function 'init_registered_system_bus':
src/core/dbus.c:798:18: warning: 'id' may be used uninitialized in this function [-Wmaybe-uninitialized]
dbus_free(id);
^
CCLD libsystemd-core.la
-Og Optimize debugging experience. -Og enables optimizations that do
not interfere with debugging. It should be the optimization level of
choice for the standard edit-compile-debug cycle, offering a
reasonable level of optimization while maintaining fast compilation
and a good debugging experience.
Also, we need to use proper strv_env_xyz() calls when putting together
the environment array, since otherwise settings won't be properly
overriden.
And let's get rid of strv_appendf(), is overkill and there was only one
user.
Previously to automatically create dependencies between mount units we
matched every mount unit agains all others resulting in O(n^2)
complexity. On setups with large amounts of mount units this might make
things slow.
This change replaces the matching code to use a hashtable that is keyed
by a path prefix, and points to a set of units that require that path to
be around. When a new mount unit is installed it is hence sufficient to
simply look up this set of units via its own file system paths to know
which units to order after itself.
This patch also changes all unit types to only create automatic mount
dependencies via the RequiresMountsFor= logic, and this is exposed to
the outside to make things more transparent.
With this change we still have some O(n) complexities in place when
handling mounts, but that's currently unavoidable due to kernel APIs,
and still substantially better than O(n^2) as before.
https://bugs.freedesktop.org/show_bug.cgi?id=69740
Prefer firmware-provided performance data over loader-exported ones; if
ACPI data is available, always use it, otherwise try to read the loader
data.
The firmware-provided variables start at the time the first EFI image
is executed and end when the operating system exits the boot services;
the (loader) time calculated in systemd-analyze increases.
Stop importing non-sensical kernel-exported variables. All
parameters in the kernel command line are exported to the
initial environment of PID1, but suppressed if they are
recognized by kernel built-in code. The EFI booted kernel
will add further kernel-internal things which do not belong
into userspace.
The passed original environ data of the process is not touched
and preserved across re-execution, to allow external reading of
/proc/self/environ for process properties like container*=.
Make Type=idle communication bidirectional: when bootup is finished,
the manager, as before, signals idling Type=idle jobs to continue.
However, if the boot takes too long, idling jobs signal the manager
that they have had enough, wait a tiny bit more, and continue, taking
ownership of the console. The manager, when signalled that Type=idle
jobs are done, makes a note and will not write to the console anymore.
This is a cosmetic issue, but quite noticable, so let's just fix it.
Based on Harald Hoyer's patch.
https://bugs.freedesktop.org/show_bug.cgi?id=54247http://unix.stackexchange.com/questions/51805/systemd-messages-after-starting-login/
Since we'll unload all units/job during a reload, and then readd them it
is really useful for clients to be aware of this phase hence sent a
signal out before and after. This signal is called "Reloading" (despite
the fact that it is also sent out during reexecution, which we consider
a special case in this context) and has one boolean parameter which is
true for the signal sent before the reload, and false for the signal
after the reload. The UnitRemoved/JobRremoved and UnitNew/JobNew due to
the reloading are guranteed to be between the pair of Reloading
messages.
Since we should allow registering/unregistering transient units with the
same name in a tight-loop, we need to make the GC more aggressive, so
that dead units are cleaned up immediately instead of later.
hence, execute the GC sweep on every event loop iteration and clean up
units. This of course, means we need to be careful with adding units to
the GC queue, which we already are since we execute check_gc() of each
unit type already when adding something to the queue.
Transient units can be created via the bus API. They are configured via
the method call parameters rather than on-disk files. They are subject
to normal GC. Transient units currently may only be created for
services (however, we will extend this), and currently only ExecStart=
and the cgroup parameters can be configured (also to be extended).
Transient units require a unique name, that previously had no
configuration file on disk.
A tool systemd-run is added that makes use of this functionality to run
arbitrary command lines as transient services:
$ systemd-run /bin/ping www.heise.de
Will cause systemd to create a new transient service and run ping in it.
Replace the very generic cgroup hookup with a much simpler one. With
this change only the high-level cgroup settings remain, the ability to
set arbitrary cgroup attributes is removed, so is support for adding
units to arbitrary cgroup controllers or setting arbitrary paths for
them (especially paths that are different for the various controllers).
This also introduces a new -.slice root slice, that is the parent of
system.slice and friends. This enables easy admin configuration of
root-level cgrouo properties.
This replaces DeviceDeny= by DevicePolicy=, and implicitly adds in
/dev/null, /dev/zero and friends if DeviceAllow= is used (unless this is
turned off by DevicePolicy=).
This complements existing functionality of setting variables
through 'systemctl set-environment', the kernel command line,
and through normal environment variables for systemd in session
mode.
This will add another color to the legend called "Loading unit files"
Like the generators it will mark a part of the systemd bar indicating
the time spent while loading unit files.
This partially reverts commit c3a170f3, which moved
efi_get_boot_timestamps too early in main(), before
/sys is assured to be mounted
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=64371
[tomegun: in particular /sys/firmware/efi/efivars needs to be
mounted, which is not a problem if a systemd-initramfs containing
the correct module is being used. But not everyone uses an
initramfs...]
When a trigger unit wants to know if a stop is queued for it, we should
just check precisely that and do not check whether it is actually
stopped already. This is because we use these checks usually from state
change calls where the state variables are not updated yet.
This change splits unit_pending_inactive() into two calls
unit_inactive_or_pending() and unit_stop_pending(). The former checks
state and pending jobs, the latter only pending jobs.
The time for systemd initialization and selinux policy loading
is accounted to the initrd or the kernel, which is wrong.
Instead of:
Startup finished in 5.559s (firmware) + 36ms (loader) + 665ms (kernel) +
975ms (initrd) + 1.410s (userspace) = 8.647s
the more correct output is:
Startup finished in 5.559s (firmware) + 36ms (loader) + 665ms (kernel) +
475ms (initrd) + 1.910s (userspace) = 8.647s
They are irrelevant and misleading.
E.g. systemd-analyze:
Startup finished in 6d 4h 15min 32.330s (kernel) + 49ms 914us (userspace) = 6d 4h 15min 32.380s
becomes
Startup finished in 53.735ms (userspace) = 53.735ms
which looks much better :)
When switching root, i.e. LANG can be set to the locale of the initramfs
or "C", if it was unset. When systemd deserializes LANG in the real root
this would overwrite the setting previously gathered by locale_set().
To reproduce, boot with an initramfs without locale.conf or change
/etc/locale.conf to a different language than the initramfs and check a
daemon started by systemd:
$ tr "$\000" '\n' </proc/$(pidof sshd)/environ | grep LANG
LANG=C
To prevent that, serialization of environment variables is skipped, when
serializing for switching root.
https://bugzilla.redhat.com/show_bug.cgi?id=949525
Before, we would initialize many fields twice: first
by filling the structure with zeros, and then a second
time with the real values. We can let the compiler do
the job for us, avoiding one copy.
A downside of this patch is that text gets slightly
bigger. This is because all zero() calls are effectively
inlined:
$ size build/.libs/systemd
text data bss dec hex filename
before 897737 107300 2560 1007597 f5fed build/.libs/systemd
after 897873 107300 2560 1007733 f6075 build/.libs/systemd
… actually less than 1‰.
A few asserts that the parameter is not null had to be removed. I
don't think this changes much, because first, it is quite unlikely
for the assert to fail, and second, an immediate SEGV is almost as
good as an assert.