Systemd

Commit Graph

Author	SHA1	Message	Date
Luca Boccassi	cda667722c	core: refresh unit cache when building a transaction if UNIT_NOT_FOUND When a command asks to load a unit directly and it is in state UNIT_NOT_FOUND, and the cache is outdated, we refresh it and attempto to load again. Use the same logic when building up a transaction and a dependency in UNIT_NOT_FOUND state is encountered. Update the unit test to exercise this code path.	2020-07-07 10:09:24 +02:00
Franck Bui	43bba15ac8	pid1: rename manager_set_{show_status,watchdog}_overridden() into manager_override_(show_status,watchdog} No functional change.	2020-06-11 12:00:32 +02:00
Franck Bui	44a419540e	pid1: rework handling of m->show_status The fact that m->show_status was serialized/deserialized made impossible any further customisation of this setting via system.conf. IOW the value was basically always locked unless it was changed via signals. This patch reworks the handling of m->show_status but also makes sure that if a new value was changed via the signal API then this value is kept and preserved accross PID1 reexecuting or reloading. Note: this effectively means that once the value is set via the signal interface, it can be changed again only through the signal API.	2020-06-09 09:16:54 +02:00
Franck Bui	b406c6d128	pid1: make manager_deserialize_{uid,gid}_refs() static No functional change.	2020-05-19 15:48:54 +02:00
Franck Bui	80f605c807	pid1: make manager_serialize_{uid,gid}_refs() static No functional change.	2020-05-19 15:48:54 +02:00
Franck Bui	06a4eb0737	pid1: make manager_vacuum_{uid,gid}_refs() static No functional change.	2020-05-19 15:48:54 +02:00
Franck Bui	1addc46c8c	pid1: make manager_flip_auto_status() static No functional change.	2020-05-19 15:48:54 +02:00
Franck Bui	986935cf6a	pid1: update manager settings on reload too Most complexity of this patch is due to the fact that some manager settings (basically the watchdog properties) can be set at runtime and in this case the runtime values must be retained over daemon-reload or daemon-reexec. For consistency sake, all watchdog properties behaves now the same way, that is: - Values defined by config files can be overridden by writing the new value through their respective D-BUS properties. In this case, these values are preserved over reload/reexec until the special value '0' or USEC_INFINITY is written, which will then restore the last values loaded from the config files. If the restored value is '0' or 'USEC_INFINITY', the watchdogs will be disabled and the corresponding device will be closed. - Reading the properties from a user instance will return the USEC_INFINITY value as these properties are only meaningful for PID1. - Writing to one of the watchdog properties of a user instance's will be a NOP. Fixes: #15453	2020-05-19 15:31:55 +02:00
Zbigniew Jędrzejewski-Szmek	5bcf34ebf3	pid1: when showing error status, do not switch to status=temporary We would flip to status=temporary mode on the first error, and then switch back to status=auto after the initial transaction was done. This isn't very useful, because usually all the messages about successfully started units and not related to the original failure. In fact, all those messages most likely cause the information about the prime error to scroll off screen. And if the user requested quiet boot, there's no reason to think that they care about those success messages. Also, when logging about dependency cycles, treat this similarly to a unit error and show the message even if the status is "soft disabled" (before we wouldn't show it in that case).	2020-03-01 11:42:42 +01:00
Zbigniew Jędrzejewski-Szmek	7365a29670	pid1: when printing status message status, give reason	2020-03-01 11:42:19 +01:00
Lennart Poettering	19d22d433d	core: add user/group resolution varlink interface to PID 1	2020-01-15 15:28:55 +01:00
Lennart Poettering	fc67a943d9	core: drop initial ListNames() bus call from PID 1 Previously, when first connecting to the bus after connecting to it we'd issue a ListNames() bus call to the driver to figure out which bus names are currently active. This information was then used to initialize the initial state for services that use BusName=. This change removes the whole code for this and replaces it with something vastly simpler. First of all, the ListNames() call was issues synchronosuly, which meant if dbus was for some reason synchronously calling into PID1 for some reason we'd deadlock. As it turns out there's now a good chance it does: the nss-systemd userdb hookup means that any user dbus-daemon resolves might result in a varlink call into PID 1, and dbus resolves quite a lot of users while parsing its policy. My original goal was to fix this deadlock. But as it turns out we don't need the ListNames() call at all anymore, since #12957 has been merged. That PR was supposed to fix a race where asynchronous installation of bus matches would cause us missing the initial owner of a bus name when a service is first started. It fixed it (correctly) by enquiring with GetOwnerName() who currently owns the name, right after installing the match. But this means whenever we start watching a bus name we anyway issue a GetOwnerName() for it, and that means also when first connecting to the bus we don't need to issue ListNames() anymore since that just tells us the same info: which names are currently owned. hence, let's drop ListNames() and instead make better use of the GetOwnerName() result: if it failed the name is not owned. Also, while we are at it, let's simplify the unit's owner_name_changed() callback(): let's drop the "old_owner" argument. We never used that besides logging, and it's hard to synthesize from just the return of a GetOwnerName(), hence don't bother.	2020-01-06 15:21:47 +01:00
Zbigniew Jędrzejewski-Szmek	3a0f06c41a	core: make TasksMax a partially dynamic property TasksMax= and DefaultTasksMax= can be specified as percentages. We don't actually document of what the percentage is relative to, but the implementation uses the smallest of /proc/sys/kernel/pid_max, /proc/sys/kernel/threads-max, and /sys/fs/cgroup/pids.max (when present). When the value is a percentage, we immediately convert it to an absolute value. If the limit later changes (which can happen e.g. when systemd-sysctl runs), the absolute value becomes outdated. So let's store either the percentage or absolute value, whatever was specified, and only convert to an absolute value when the value is used. For example, when starting a unit, the absolute value will be calculated when the cgroup for the unit is created. Fixes #13419.	2019-11-14 18:41:54 +01:00
Zbigniew Jędrzejewski-Szmek	6123dfaa72	pid1: disable printk ratelimit in early boot We have the problem that many early boot or late shutdown issues are harder to solve than they could be because we have no logs. When journald is not running, messages are redirected to /dev/kmsg. It is also the time when many things happen in a rapid succession, so we tend to hit the kernel printk ratelimit fairly reliably. The end result is that we get no logs from the time where they would be most useful. Thus let's disable the kernels ratelimit. Once the system is up and running, the ratelimit is not a problem. But during normal runtime, things also log to journald, and not to /dev/kmsg, so the ratelimit is not useful. Hence, there doesn't seem to be much point in trying to restore the ratelimit after boot is finished and journald is up and running. See kernel's commit 750afe7babd117daabebf4855da18e4418ea845e for the description of the kenrel interface. Our setting has lower precedence than explicit configuration on the kenrel command line.	2019-09-20 16:05:53 +02:00
Lennart Poettering	5756bff6f1	Merge pull request #13119 from keszybz/unit-loading-2 Rework unit loading to take into account all aliases	2019-07-30 17:55:37 +02:00
Zbigniew Jędrzejewski-Szmek	91e0ee5f16	pid1: drop unit caches only based on mtime v2: - do not watch mtime of transient and generated dirs We'd reload the map after every transient unit we created, which we don't need to do, since we create those units ourselves and know their fragment path.	2019-07-30 14:01:46 +02:00
Zbigniew Jędrzejewski-Szmek	e8630e6952	pid1: use a cache for all unit aliases This reworks how we load units from disk. Instead of chasing symlinks every time we are asked to load a unit by name, we slurp all symlinks from disk and build two hashmaps: 1. from unit name to either alias target, or fragment on disk (if an alias, we put just the target name in the hashmap, if a fragment we put an absolute path, so we can distinguish both). 2. from a unit name to all aliases Reading all this data can be pretty costly (40 ms) on my machine, so we keep it around for reuse. The advantage is that we can reliably know what all the aliases of a given unit are. This means we can reliably load dropins under all names. This fixes #11972.	2019-07-30 14:01:46 +02:00
Luca Boccassi	65224c1d0e	core: rename ShutdownWatchdogSec to RebootWatchdogSec This option is only used on reboot, not on other types of shutdown modes, so it is misleading. Keep the old name working for backward compatibility, but remove it from the documentation.	2019-07-23 20:29:03 +01:00
Luca Boccassi	acafd7d8a6	core: add KExecWatchdogSec option Rather than always enabling the shutdown WD on kexec, which might be dangerous in case the kernel driver and/or the hardware implementation does not reset the wd on kexec, add a new timer, disabled by default, to let users optionally enable the shutdown WD on kexec separately from the runtime and reboot ones. Advise in the documentation to also use the runtime WD in conjunction with it. Fixes: `a637d0f9ec` ("core: set shutdown watchdog on kexec too")	2019-07-23 20:29:03 +01:00
Michael Olbrich	da8e178296	job: make the run queue order deterministic Jobs are added to the run queue in random order. This happens because most jobs are added by iterating over the transaction or dependency hash maps. As a result, jobs that can be executed at the same time are started in a different order each time. On small embedded devices this can cause a measurable jitter for the point in time when a job starts (~100ms jitter for 10 units that are started in random order). This results is a similar jitter for the boot time. This is undesirable in general and make optimizing the boot time a lot harder. Also, jobs that should have a higher priority because the unit has a higher CPU weight might get executed later than others. Fix this by turning the job run_queue into a Prioq and sort by the following criteria (use the next if the values are equal): - CPU weight - nice level - unit type - unit name The last one is just there for deterministic sorting to avoid any jitter.	2019-07-18 10:28:39 +02:00
Zbigniew Jędrzejewski-Szmek	36cf45078c	Add config and kernel commandline option to use short identifiers No functional change, just docs and configuration and parsing. v2: - change ShortIdentifiers=yes\|no to StatusUnitFormat=name\|description.	2019-07-10 13:35:26 +02:00
Yu Watanabe	9c79f0e0a0	core: add assertion in two inline functions	2019-04-14 20:46:24 +09:00
Jan Klötzke	dc653bf487	service: handle abort stops with dedicated timeout When shooting down a service with SIGABRT the user might want to have a much longer stop timeout than on regular stops/shutdowns. Especially in the face of short stop timeouts the time might not be sufficient to write huge core dumps before the service is killed. This commit adds a dedicated (Default)TimeoutAbortSec= timer that is used when stopping a service via SIGABRT. In all other cases the existing TimeoutStopSec= is used. The timer value is unset by default to skip the special handling and use TimeoutStopSec= for state 'stop-watchdog' to keep the old behaviour. If the service is in state 'stop-watchdog' and the service should be stopped explicitly we still go to 'stop-sigterm' and re-apply the usual TimeoutStopSec= timeout.	2019-04-12 17:32:52 +02:00
Lennart Poettering	afcfaa695c	core: implement OOMPolicy= and watch cgroups for OOM killings This adds a new per-service OOMPolicy= (along with a global DefaultOOMPolicy=) that controls what to do if a process of the service is killed by the kernel's OOM killer. It has three different values: "continue" (old behaviour), "stop" (terminate the service), "kill" (let the kernel kill all the service's processes). On top of that, track OOM killer events per unit: generate a per-unit structured, recognizable log message when we see an OOM killer event, and put the service in a failure state if an OOM killer event was seen and the selected policy was not "continue". A new "result" is defined for this case: "oom-kill". All of this relies on new cgroupv2 kernel functionality: the "memory.events" notification interface and the "memory.oom.group" attribute (which makes the kernel kill all cgroup processes automatically).	2019-04-09 11:17:58 +02:00
Lennart Poettering	0bb814c2c2	core: rename cgroup_inotify_wd → cgroup_control_inotify_wd Let's rename the .cgroup_inotify_wd field of the Unit object to .cgroup_control_inotify_wd. Let's similarly rename the hashmap .cgroup_inotify_wd_unit of the Manager object to .cgroup_control_inotify_wd_unit. Why? As preparation for a later commit that allows us to watch the "memory.events" cgroup attribute file in addition to the "cgroup.events" file we already watch with the fields above. In that later commit we'll add new fields "cgroup_memory_inotify_wd" to Unit and "cgroup_memory_inotify_wd_unit" to Manager, that are used to watch these other events file. No change in behaviour. Just some renaming.	2019-04-09 11:17:57 +02:00
Zbigniew Jędrzejewski-Szmek	237ebf61e2	Merge pull request #12013 from yuwata/fix-switchroot-11997 core: on switching root do not emit device state change based on enumeration results	2019-04-02 16:06:07 +02:00
Lennart Poettering	50cbaba4fe	core: add new API for enqueing a job with returning the transaction data	2019-03-27 12:37:37 +01:00
Franck Bui	f75f613d25	core: reduce the number of stalled PIDs from the watched processes list when possible Some PIDs can remain in the watched list even though their processes have exited since a long time. It can easily happen if the main process of a forking service manages to spawn a child before the control process exits for example. However when a pid is about to be mapped to a unit by calling unit_watch_pid(), the caller usually knows if the pid should belong to this unit exclusively: if we just forked() off a child, then we can be sure that its PID is otherwise unused. In this case we take this opportunity to remove any stalled PIDs from the watched process list. If we learnt about a PID in any other form (for example via PID file, via searching, MAINPID= and so on), then we can't assume anything.	2019-03-20 10:51:49 +01:00
Yu Watanabe	c6e892bc0e	core: add Manager::honor_device_enumeration flag When system manager is started first time or after switching root, then the udev's device tag data do not exist yet. So, let's not honor the enumeration results. Fixes #11997.	2019-03-15 19:47:43 +09:00
Zbigniew Jędrzejewski-Szmek	ec8126d723	Revert "core/mount: minimize impact on mount storm." This reverts commit `89f9752ea0`. This patch causes various problems during boot, where a "mount storm" occurs naturally. Current approach is flakey, and it seems very risky to push a feature like this which impacts boot right before a release. So let's revert for now, and consider a more robust solution after later. Fixes #11209. > https://github.com/systemd/systemd/pull/11196#issuecomment-448523186: "Reverting `89f9752ea0` and `fcfb1f775e` fixes this test."	2018-12-19 11:37:41 +01:00
Zbigniew Jędrzejewski-Szmek	e36db50075	Revert "mount: disable mount-storm protection while mount unit is starting." This reverts commit `fcfb1f775e`.	2018-12-19 11:32:17 +01:00
NeilBrown	fcfb1f775e	mount: disable mount-storm protection while mount unit is starting. The starting of mount units requires that changes to /proc/self/mountinfo be processed before the SIGCHILD from the completion of /sbin/mount is processed, as described by the comment /* Note that due to the io event priority logic, we can be sure the new mountinfo is loaded * before we process the SIGCHLD for the mount command. */ The recently-added mount-storm protection can defeat this as it will sometimes deliberately delay processing of /proc/self/mountinfo. So we need to disable mount-storm protection when a mount unit is starting. We do this by keeping a counter of the number of pending mounts, and disabling the protection when this is non-zero. Thanks to @asavah for finding and reporting this problem.	2018-12-19 00:44:19 +01:00
NeilBrown	89f9752ea0	core/mount: minimize impact on mount storm. If we create 2000 mounts (on a 1-CPU qemu VM) with mkdir -p /MNT/{1..2000} time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done it takes around 20 seconds to complete. Much of this time is taken up by systemd repeatedly processing /proc/self/mountinfo. If I disable the processing, the time drops to about 4 seconds. I have reports that on a larger system with multiple active user sessions, each with it's own systemd, the impact can be higher. One particular use-case where a large number of mounts can be expected in quick succession is when the "clearcase" SCM starts up. This patch modifies the handling up events from /proc/self/mountinfo so that systemd backs off when a storm is detected. Specifically the time to process mountinfo is measured, and the process will not be repeated until 10 times that duration has passed. This ensures systemd won't use more than 10% of real time processing mountinfo. With this patch, my test above takes about 5 seconds.	2018-12-16 12:38:40 +01:00
Lennart Poettering	4a53080be6	core: don't track jobs-finishing-during-reload explicitly Memory management is borked for this, and moreover this is unnecessary since `f0831ed2a0`, i.e. since coldplug() and catchup() are two different concepts: the former restoring the state from before a reload, the latter than adjusting it again to the actual status in effect after the reload. Fixes: #10716 Mostly reverts: #8803	2018-12-12 11:15:06 +01:00
Lennart Poettering	79a224c460	main: when reloading PID 1 let's reset the default environment Otherwise we keep collecting stuff from env generators, and we really shouldn't. This was working properly on reexec but not on reload, as for reexec we would always start fresh, but for reload would reuse the Manager object and hence its default environment set. Fixes: #10671	2018-11-19 13:01:19 +01:00
Lennart Poettering	3dafa6bc76	core: drop dbus queue recursion check We don't dispatch the queue recursively anymore, hence let's simplify things a bit. As pointed out by @fbuihuu: https://github.com/systemd/systemd/pull/10763#discussion_r233209550	2018-11-14 20:09:11 +01:00
Lennart Poettering	209de5256b	core: rename queued_message → pending_reload_message This field is only used for pending Reload() replies, hence let's rename it to be more descriptive and precise. No change in behaviour.	2018-11-13 11:59:06 +01:00
Lennart Poettering	1ad6e8b302	core: split environment block mantained by PID 1's Manager object in two This splits the "environment" field of Manager into two: transient_environment and client_environment. The former is generated from configuration file, kernel cmdline, environment generators. The latter is the one the user can control with "systemctl set-environment" and similar. Both sets are merged transparently whenever needed. Separating the two sets has the benefit that we can safely flush out the former while keeping the latter during daemon reload cycles, so that env var settings from env generators or configuration files do not accumulate, but dynamic API changes are kept around. Note that this change is not entirely transparent to users: if the user first uses "set-environment" to override a transient variable, and then uses "unset-environment" to unset it again things will revert to the original transient variable now, while previously the variable was fully removed. This change in behaviour should not matter too much though I figure. Fixes: #9972	2018-10-31 18:00:53 +01:00
Yu Watanabe	d0955f0091	core: replace udev_monitor by sd_device_monitor	2018-10-17 03:31:20 +09:00
Lennart Poettering	638cece45d	core: clean up test run flags Let's make them typesafe, and let's add a nice macro helper for checking if we are in a test run, which should make testing for this much easier to read for most cases.	2018-10-09 19:43:43 +02:00
Lennart Poettering	ed4ac965fa	manager: rework test flags set No reason to avoid bit 0. Also, fix some tests that pass "true" as flags value, which is just wrong.	2018-10-09 19:43:43 +02:00
Lennart Poettering	af41e5086d	core: rename ManagerExitCode → ManagerObjective "ExitCode" is a bit of a misnomer in two ways: it suggests this was about the "exit code" concept that exit()/waitid() deal with, but really isn't. Moreover, it's not event just about exiting either, but more often about reloading/reexecing or rebooting. Let's hence pick a new name for this that is a bit more correct. I initially thought about naming this the "state", but that'd be a misnomer too, as the value really encodes a "goal" more than a current state. Also we already have the externally visible ManagerState. No actual changes in behaviour, just the rename.	2018-10-09 19:43:43 +02:00
Lennart Poettering	899987456c	manager: add explanatory comment regarding ManagerState	2018-10-09 19:43:43 +02:00
Yu Watanabe	4366e598ae	core: replace udev_device by sd_device	2018-08-23 04:57:39 +09:00
Zbigniew Jędrzejewski-Szmek	00c4361878	Merge pull request #9853 from poettering/uneeded-queue rework StopWhenUnneeded=1 logic	2018-08-21 10:06:30 +02:00
Lennart Poettering	a3c1168ac2	core: rework StopWhenUnneeded= logic Previously, we'd act immediately on StopWhenUnneeded= when a unit state changes. With this rework we'll maintain a queue instead: whenever there's the chance that StopWhenUneeded= might have an effect we enqueue the unit, and process it later when we have nothing better to do. This should make the implementation a bit more reliable, as the unit notify event cannot immediately enqueue tons of side-effect jobs that might contradict each other, but we do so only in a strictly ordered fashion, from the main event loop. This slightly changes the check when to consider a unit "unneeded". Previously, we'd assume that a unit in "deactivating" state could also be cleaned up. With this new logic we'll only consider units unneeded that are fully up and have no job queued. This means that whenever there's something pending for a unit we won't clean it up.	2018-08-10 16:19:01 +02:00
Yu Watanabe	7bc740f480	core: add comments about timestamps stored in manager	2018-08-06 22:21:05 +09:00
Yu Watanabe	d4ee7bd849	core: serialize/deserialize several timestamps on initrd in different names	2018-07-24 03:45:51 +09:00
Lennart Poettering	0c69794138	tree-wide: remove Lennart's copyright lines These lines are generally out-of-date, incomplete and unnecessary. With SPDX and git repository much more accurate and fine grained information about licensing and authorship is available, hence let's drop the per-file copyright notice. Of course, removing copyright lines of others is problematic, hence this commit only removes my own lines and leaves all others untouched. It might be nicer if sooner or later those could go away too, making git the only and accurate source of authorship information.	2018-06-14 10:20:20 +02:00
Lennart Poettering	818bf54632	tree-wide: drop 'This file is part of systemd' blurb This part of the copyright blurb stems from the GPL use recommendations: https://www.gnu.org/licenses/gpl-howto.en.html The concept appears to originate in times where version control was per file, instead of per tree, and was a way to glue the files together. Ultimately, we nowadays don't live in that world anymore, and this information is entirely useless anyway, as people are very welcome to copy these files into any projects they like, and they shouldn't have to change bits that are part of our copyright header for that. hence, let's just get rid of this old cruft, and shorten our codebase a bit.	2018-06-14 10:20:20 +02:00

1 2 3 4 5

216 Commits