The ret_size result is a bit of an awkward optimization that in a
sense enables bypassing the mmap-cache API, while encouraging
duplication of logic it already implements.
It's only utilized in one place; journal_file_move_to_object(),
apparently to avoid the overhead of remapping the whole object
again once its header, and thus its actual size, is known.
With mmap-cache's context cache, the overhead of simply
re-getting the object with the now known size should already be
negligible. So it's not clear what benefit this brings, unless
avoiding some function calls that do very little in the hot
context-cache hit case is of such a priority.
There's value in having all object-sized gets pass through
mmap_cache_get(), as it provides a single entrypoint for
instrumentation in profiling/statistics gathering. When
journal_file_move_to_object() bypasses getting the full object
size, you don't capture the full picture on the mmap-cache side
in terms of object sizes explicitly loaded from a journal file.
I'd like to see additional accounting in mmap_cache_get() in a
future commit, taking advantage of this change.
There are no mmap_cache_get() users that actually deviate prot
from the JournalFile's f->prot.
So there's no point in making this a separate parameter to
mmap_cache_get(), nor is there any need to store it in
JournalFile's f->prot.
Instead just pass it to mmap_cache_add_fd() at MMapFileDescriptor
creation, storing it in there for the mmap() callers, which
already receive MMapFileDescriptor *.
For functions receiving both an MMapFileDescriptor * and prot,
the prot argument has been simply removed and call sites updated.
Formalizing this fd:prot binding at the public API also enables
discarding the prot check in window_matches(), which is a hot
function on long window lists, so a minor CPU efficiency gain
should be had there as seen with the past removal of the fd
check. Unnoticable for uncached journals, but maybe a little
runtime improvement when cached in specific circumstances.
window_matches_fd() has also been simplified to treat the
MMapFileDescrptor * as equivalent to its fd and prot.
Even with the new keyed hash table journal feature: if an attacker
manages to get access to the journal file id it could synthesize records
that result in hash collisions. Let's rotate automatically when we
notice that, so that a new journal file ID is generated, our performance
is restored and the attacker has to guess a new file ID before being
able to trigger the issue again.
That said, untrusted peers should never get access to journal files in
the first case...
This adds a new (incompatible) feature to journal files: if enabled the
hash function used for the hash tables is no longer jenkins hash with a
zero key, but siphash keyed by the file uuid that is included in the
file header anyway. This should make our hash tables more robust against
collision attacks, as long as the attacker has no read access to the
journal files. We switch from jenkins to siphash simply because it's
more well-known and we standardize for the rest of our codebase onto it.
This is hardening in order to make collision attacks harder for clients
that can forge log messages but have no read access to the logs. It has
no effect on clients that have read access.
Let's prefix this with "jenkins_" since it wraps the jenkins hash. We
want to add support for other hash functions to journald soon, hence
better be clear with what this is. In particular as all other symbols
defined by lookup3.h actually are prefixed "jenkins_".
Let's clean this up a bit, following our usual nomenclature to name
return parameters ret-xyz.
This is mostly a bit of renaming, but there's also some minor other
changes: if we return a pointer to a mmap'ed object plus its offset, in
almost all cases we are happy if either parameter is NULL in case the
caller is not interested in it. Let's fix the remaining case to do this
too, to minimize surprises.
Our journal code is generally supposed to be written in a fashion that
the underlying file can be deallocated any time, i.e. our mmap of it
suddenly becomes all zeroes. The idea is that we catch that when parsing
everything. For that to work safely we need to make sure that when doing
arithmetics or comparisons on values read from the map we don't run into
TTOCTTOU issues when determining validity. Hence we need to copy out the
values before use and operate on the copies. This requires some special
care since the C compiler could suppress our copies as optimization.
Hence use the new READ_NOW() macro to force a copy by using memcpy(),
and use it whenever we start doing an arithmetic operation on it, or
validity checking of multiple steps.
Fixes: #14943
https://bugzilla.redhat.com/show_bug.cgi?id=1715699
> /dev/mapper/live-rw 6.4G 5.7G 648M 91% /
> systemd-journald[905]: Fixed min_use=1.0M max_use=648.7M max_size=81.0M min_size=512.0K keep_free=973.1M n_max_files=100
When journald is started, we pick keep_free as 15% of the disk size. When the
fs is almost filled, we will only keep one journal file around and rotate very
often (because min_size is very small).
Let's set min use to something reasonable, so that we get more useful logs that
will cover at least the full boot.
Some cases considered in the PR:
> /dev/mapper/live-rw 6.4G 5.7G 648M 91% /
keep_free→MIN(327,100)→100 MB.
min_use→16MB.
effective range: 16 MB – 548 MB
> /dev/mapper/fedora_krowka-root 78G 69G 5.7G 93% /
keep_free → MIN(4GB, 100MB)→100MB
min_use→16MB
effective range: 16 MB – 5.6 GB
(but then there's the max_use limit, which cuts the range down)
> 4TB, 4GB free
keep_free → MIN(209715, 100) → 100 MB
min_use→16MB
effective range: 16 MB – 4.9 GB
(also effectively limited by max_use)
Also replace unneeded width suffixes with spaces, I think this is more
readable, and drop DEFAULT_ prefixes in cases where this setting is
simply a bound, and cannot be overridden by user config, hence is not
a default.
systemd-journal-remote always wrote the boot-id of the device it was running on
to the header of its journal files. When the source had a different boot-id
(because it was generated on a different boot, or a different device), the
boot-ids in the file were inconsistent. The _BOOT_ID field was that of the
source, but the journal file header and each entry object header were that of
the device systemd-journal-remote ran on. This breaks journalctl --list-boots
on any of these files.
Set the boot-id in the header to be that of the source. This also fixes the
entry object headers.
The thread launched in journal_file_set_offline() accesses a memory
mapped file, so it needs to handle SIGBUS. Leave SIGBUS unblocked on the
offlining thread so that it uses the same handler as the main thread.
The result of triggering SIGBUS in a thread where it's blocked is
undefined in Linux. The tested implementations were observed to cause
the default handler to run, taking down the whole journald process.
We can leave SIGBUS unblocked in multiple threads since it's handler is
thread-safe. If SIGBUS is sent to the journald process asynchronously
(i.e. with kill, sigqueue, or raise), either thread handling it will
result in the same behavior: it will install the default handler and
reraise the signal, killing the process.
Fixes: #12042
Continuation of a3ebe5eb620e49f0d24082876cafc7579261e64f:
in other places we sometimes use assert_se(), and sometimes normal error
handling. sigfillset and sigaddset can only fail if mask is NULL (which cannot
happen if we are passing in a reference), or if the signal number is invalid
(which really shouldn't happen when we are using a constant like SIGCHLD. If
SIGCHLD is invalid, we have a bigger problem). So let's simplify things and
always use assert_se() in those cases.
In sigset_add_many() we could conceivably pass an invalid signal, so let's keep
normal error handling here. The caller can do assert_se() around the
sigprocmask_many() call if appropriate.
'>= 0' is used for consistency with the rest of the codebase.
Ideally, coccinelle would strip unnecessary braces too. But I do not see any
option in coccinelle for this, so instead, I edited the patch text using
search&replace to remove the braces. Unfortunately this is not fully automatic,
in particular it didn't deal well with if-else-if-else blocks and ifdefs, so
there is an increased likelikehood be some bugs in such spots.
I also removed part of the patch that coccinelle generated for udev, where we
returns -1 for failure. This should be fixed independently.
Let's split out the part that actually renames the file in case we can't
open it into a new function journal_file_dispose().
This way we can reuse the function in other cases where we want to open
a file but can't.
Let's split the function in three: the part where we archive the old
file into journal_file_archive(), and the part where we initiate the
deferred closing into journal_file_initiate_close().
journal_file_rotate() then simply becomes a wrapper around these two
calls, and the opening of the new journal file.
This useful so that we can archive journal files without having to open
new ones, i.e. to do only the archival part of the rotation, without the
rotation part.
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.