The journal verification functions would reject such an entry. It would probably
still display fine (because we prefer _SOURCE_REALTIME_TIMESTAMP= if present), but
it seems wrong to create an entry that would not pass verification.
The code to open journal files seems like the wrong place to enforce this. We
already check during boot and refuse to boot if machine-id is missing, no need
to enforce this here. In particular, it seems better to write logs from
journald even if they are not completely functional rather than refuse to
operate at all, and systemd-journal-remote also writes journal files and may
even be run on a system without systemd at all.
The docker image that oss-fuzz uses has an empty /etc/machine-id. Obviously
this is an error in the docker, but docker is fact of life, and it seems better
for systemd-journal-remote to work in such an incomplete environment.
In journal_file_set_online() the offline thread doesn't need to be
joined if it's been canceled before actually reaching the phase of
writing the offline state.
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
It might happen that we try to bisect through a chain of offset arrays in the
journal whose last element was just allocated but no item yet written
to. In that case that array will be all NUL, but it might still end up
in our array chain cache. If it does, we cannot use it for bisection,
since for bisection we need to know the value of the first entry in that
array, but if it's uninitialized it does not have a first value. Hence,
as a simple fix, in this unlikely case, simply ignore the chain cache.
This is supposed to fix the issue pointed out in #8432, but in a more
permissive way, as this case isn't strictly a badly formatted journal
but actually a valid state (though one within a very short time window),
and we should make the best of it, and handle it gracefully.
Background: in each journal file entries are linked up in large arrays
of offsets. In each array the entries are strictly ordered by the
offsets of the entries, which permits search by bisection. These arrays
are allocated with a fixed size and then filled up as entries are added
to the journal file. If an array is fully filled up, a new array
(double in size as the old one) is appended to the journal file, and
linked up. This means, the journal file will contain a series of chained
up arrays, each time doubling in size, and strictly ordered. When
looking for an entry we maintain a "chain cache", which allows us to
bypass traversing the chain in full if we look for entries close to each
other in a short time. With the fix above we make sure we don't
erroneously use a chain cache item that doesn't carry enough information
for this bisection to work.
Original issue identified (with patch) by @Kxuan.
Replaces: #8432
Previously the compression threshold was hardcoded to 512, which meant that
smaller values wouldn't be compressed. This left some storage savings on the
table, so instead, we make that number tunable.
The Linux kernel exposes the birth time now for files through statx()
hence make use of it where available. We keep the xattr logic in place
for this however, since only a subset of file systems on Linux currently
expose the birth time. NFS and tmpfs for example do not support it. OTOH
there are other file systems that do support the birth time but might
not support xattrs (smb…), hence make the best of the two, in particular
in order to deal with journal files copied between file system types and
to maintain compatibility with older file systems that are updated to
newer version of the file system.
Let's add a common implementation for regular file checks, that are
careful to return the right error code (EISDIR/EISLNK/EBADFD) when we
are encountering a wrong file node.
Previously, we'd refuse open journal files with suffixes that aren't
either .journal or .journal~. With this change we only care when we are
creating the journal file.
I looked over the sources to see whether we ever pass files discovered
by directory enumeration to journal_file_open() without first checking
the suffix (in which case the old check made sense), but I couldn't find
any. hence I am pretty sure removing this check is safe.
Fixes: #7972
This ensures that in all threads we fork off in the background in our
code we mask out all signals, so that our thread won't end up getting
signals delivered the main process should be getting.
We always set the signal mask before forking off the thread, so that the
thread has the right mask set from its earliest existance on.
clang warns about a few sites like this:
../src/journal/journal-file.c:1780:48: warning: taking address of packed member 'entry_offset' of class or structure 'DataObject' may result in an unaligned pointer value [-Waddress-of-packed-member]
&o->data.entry_offset,
^~~~~~~~~~~~~~~~~~~~
but DataObject.entry_offset will always be 8-byte aligned as long as
the DataObject structure is aligned. Similarly in other cases, the
field is always aligned. Let's just silence the warning to avoid noise.
gcc does not know -Waddress-of-packed-member, and would warn about an unknown
warning, so we need to conditionalize on __clang__.
The advantage is that is the name is mispellt, cpp will warn us.
$ git grep -Ee "conf.set\('(HAVE|ENABLE)_" -l|xargs sed -r -i "s/conf.set\('(HAVE|ENABLE)_/conf.set10('\1_/"
$ git grep -Ee '#ifn?def (HAVE|ENABLE)' -l|xargs sed -r -i 's/#ifdef (HAVE|ENABLE)/#if \1/; s/#ifndef (HAVE|ENABLE)/#if ! \1/;'
$ git grep -Ee 'if.*defined\(HAVE' -l|xargs sed -i -r 's/defined\((HAVE_[A-Z0-9_]*)\)/\1/g'
$ git grep -Ee 'if.*defined\(ENABLE' -l|xargs sed -i -r 's/defined\((ENABLE_[A-Z0-9_]*)\)/\1/g'
+ manual changes to meson.build
squash! build-sys: use #if Y instead of #ifdef Y everywhere
v2:
- fix incorrect setting of HAVE_LIBIDN2
Introduce journal_file_check_object(), which does lightweight object
sanity checks, and use it in journal_file_move_to_object(), so that we
will catch certain corrupted objects in the journal file.
This fixes#6447, where we had only partially written out OBJECT_ENTRY
(ObjectHeader written, but rest of object zero bytes), causing
"journalctl --list-boots" to fail.
$ builddir.vanilla/journalctl --list-boots -D bug6447/
Failed to determine boots: No data available
$ builddir.patched/journalctl --list-boots -D bug6447/
-52 22633da1c5374a728d6c215e2c301dc2 Mon 2017-07-10 05:29:21 EEST—Mon 2017-07-10 05:31:51 EEST
-51 2253aab9ea7e4a2598f2abda82939eff Mon 2017-07-10 05:32:22 EEST—Mon 2017-07-10 05:36:49 EEST
-50 ef0d85d35c74486fa4104f9d6391b6ba Mon 2017-07-10 05:40:33 EEST—Mon 2017-07-10 05:40:40 EEST
[...]
Note that journal_file_check_object() is similar to
journal_file_object_verify(). The most expensive checks are omitted, as
they would slow down every journal_file_move_to_object() call too much.
With this implementation, the added overhead is small, for example when
dumping some journal content to /dev/null
(built with -Dbuildtype=debugoptimized -Db_ndebug=true):
Performance counter stats for 'builddir.vanilla/journalctl -D 76f4d4c3406945f9a60d3ca8763aa754/':
12542,311634 task-clock:u (msec) # 1,000 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
80 100 page-faults:u # 0,006 M/sec
41 786 963 456 cycles:u # 3,332 GHz
105 453 864 770 instructions:u # 2,52 insn per cycle
24 342 227 334 branches:u # 1940,809 M/sec
105 709 217 branch-misses:u # 0,43% of all branches
12,545199291 seconds time elapsed
Performance counter stats for 'builddir.patched/journalctl -D 76f4d4c3406945f9a60d3ca8763aa754/':
12734,723233 task-clock:u (msec) # 1,000 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
80 693 page-faults:u # 0,006 M/sec
42 661 017 429 cycles:u # 3,350 GHz
107 696 985 865 instructions:u # 2,52 insn per cycle
24 950 526 745 branches:u # 1959,252 M/sec
101 762 806 branch-misses:u # 0,41% of all branches
12,737527327 seconds time elapsed
Fixes#6447.
journal_file_move_to_object() can skip the second
journal_file_move_to() call if the first one already mapped a
sufficiently large area.
Now that mmap_cache_get() returns the size of the mapped area
when asked, ask for the size and only perform the second call if
the required size exceeds the mapped size instead of the object
header size.
This results in a nice performance boost in my testing, even with
a corpus of many small logs burning much CPU time elsewhere:
Before:
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m16.330s
user 0m16.281s
sys 0m0.046s
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m16.409s
user 0m16.358s
sys 0m0.048s
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m16.625s
user 0m16.558s
sys 0m0.061s
After:
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m15.311s
user 0m15.257s
sys 0m0.046s
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m15.201s
user 0m15.135s
sys 0m0.062s
# time ./journalctl -b -1 --no-pager > /dev/null
real 0m15.170s
user 0m15.113s
sys 0m0.053s
If requested, return the actual mapping size to the caller in
addition to the address.
journal_file_move_to_object() often performs two successive
mmap_cache_get() calls via journal_file_move_to(); one to get the
object header, then another to get the entire object when it's
larger than the header's size.
If mmap_cache_get() returned the actual mapping's size, it's
probable that the second mmap_cache_get() could be skipped when
the established mapping already encompassed the desired size.
This way we have a MMapFileDescriptor reference external to the cache,
and can supply the handle directly to mmap_cache_get(), eliminating
hashmap lookups entirely from the hot path.
When some error occurs during the initialization of JournalFile,
the JournalFile can be left without hash tables created. When later
trying to append an entry to that file, the assertion in
journal_file_link_data() fails, and journald crashes.
This patch fix this issue by checking *_hash_table_size in
journal_file_verify_header().
It is possible to overflow uint64_t while validating the header of
a journal file. To prevent this, the addition itself is checked to
be within the limits of UINT64_MAX first.
To keep this readable, I have introduced two stack variables which
hold the converted values during validation.
76ec966f0e changed the code from ESHUTDOWN to ERFKILL, but missed one
spot in bus-common-errors.c. Fix that.
The code in transaction.c was checking for ERFKILL, but I'm not sure if this
mismatch had any effect, i.e. if there were any code paths in which the wrong
code actually made difference.
Also add comments when ESHUTDOWN is used in the journal code, so it's easy to
distinguish those cases when grepping. Standarize on the same capitalization.
(There's also a bunch of uses in sd-bus.c, but that's clearly different.)
* logind: trivial simplification
free_and_strdup() handles NULL arg, so make use of that.
* boot: fix two typos
* pid1: rewrite check in ignore_proc() to not check condition twice
It's harmless, but it seems nicer to evaluate a condition just a single time.
* core/execute: reformat exec_context_named_iofds() for legibility
* core/execute.c: check asprintf return value in the usual fashion
This is unlikely to fail, but we cannot rely on asprintf return value
on failure, so let's just be correct here.
CID #1368227.
* core/timer: use (void)
CID #1368234.
* journal-file: check asprintf return value in the usual fashion
This is unlikely to fail, but we cannot rely on asprintf return value
on failure, so let's just be correct here.
CID #1368236.
* shared/cgroup-show: use (void)
CID #1368243.
* cryptsetup: do not return uninitialized value on error
CID #1368416.
gcc 7 adds -Wimplicit-fallthrough=3 to -Wextra. There are a few ways
we could deal with that. After we take into account the need to stay compatible
with older versions of the compiler (and other compilers), I don't think adding
__attribute__((fallthrough)), even as a macro, is worth the trouble. It sticks
out too much, a comment is just as good. But gcc has some very specific
requiremnts how the comment should look. Adjust it the specific form that it
likes. I don't think the extra stuff we had in those comments was adding much
value.
(Note: the documentation seems to be wrong, and seems to describe a different
pattern from the one that is actually used. I guess either the docs or the code
will have to change before gcc 7 is finalized.)
https://bugzilla.redhat.com/show_bug.cgi?id=1416201
$ journalctl -b
Journal file /var/log/journal/ad18f69b80264b52bb3b766240742383/system@0005467d92e23784-a6571c8b69d09124.journal~ uses an unsupported feature, ignoring file.
Use SYSTEMD_LOG_LEVEL=debug journalctl --file=/var/log/journal/ad18f69b80264b52bb3b766240742383/system@0005467d92e23784-a6571c8b69d09124.journal~ to see the details.
-- No entries --
$ journalctl --file=/var/log/journal/ad18f69b80264b52bb3b766240742383/system@0005467d92e23784-a6571c8b69d09124.journal~
Journal file /var/log/journal/ad18f69b80264b52bb3b766240742383/system@0005467d92e23784-a6571c8b69d09124.journal~ uses incompatible flag lz4-compressed disabled at compilation time.
Failed to open journal file /var/log/journal/ad18f69b80264b52bb3b766240742383/system@0005467d92e23784-a6571c8b69d09124.journal~: Protocol not supported
mmap cache statistics: 0 hit, 1 miss
Failed to open files: Protocol not supported
Never permit that we write to journal files that have newer timestamps than our
local wallclock has. If we'd accept that, then the entries in the file might
end up not being ordered strictly.
Let's refuse this with ETXTBSY, and then immediately rotate to use a new file,
so that each file remains strictly ordered also be wallclock internally.
When iterating through partially synced journal files we need to be prepared
for hitting with invalid entries (specifically: non-initialized). Instead of
generated an error and giving up, let's simply try to preceed with the next one
that is valid (and debug log about this).
This reworks the logic introduced with caeab8f626
to iteration in both directions, and tries to look for valid entries located
after the invalid one. It also extends the behaviour to both iterating through
the global entry array and per-data object entry arrays.
Fixes: #4088
Let's make dissecting of borked journal files more expressive: if we encounter
an object whose first 8 bytes are all zeroes, then let's assume the object was
simply never initialized, and say so.
Previously, this would be detected as "overly short object", which is true too
in a away, but it's a lot more helpful printing different debug options for the
case where the size is not initialized at all and where the size is initialized
to some bogus value.
No function behaviour change, only a different log messages for both cases.
Let's make it easier to figure out when we see an invalid journal file, why we
consider it invalid, and add some minimal debug logging for it.
This log output is normally not seen (after all, this all is library code),
unless debug logging is exlicitly turned on.
Since commit 5996c7c295 (v190 !), the
calculation of the HMAC is broken because the hash for a data object
including a field is done in the wrong order: the field object is
hashed before the data object is.
However during verification, the hash is done in the opposite order as
objects are scanned sequentially.
The only code path which makes a journal durable is via
journal_file_set_offline().
When we perform a rotate the journal's header->state is being set to
STATE_ARCHIVED prior to journal_file_set_offline() being called.
In journal_file_set_offline(), we short-circuit the entire offline when
f->header->state != STATE_ONLINE.
This all results in none of the journal_file_set_offline() fsync() calls
being reached when rotate archives a journal, so archived journals are
never explicitly made durable.
What we do now is instead of setting the f->header->state to
STATE_ARCHIVED directly in journal_file_rotate() prior to
journal_file_close(), we set an archive flag in f->archive for the
journal_file_set_offline() machinery to honor by committing
STATE_ARCHIVED instead of STATE_OFFLINE when set.
Prior to this, rotated journals were never getting fsync() explicitly
performed on them, since journal_file_set_offline() short-circuited.
Obviously this is undesirable, and depends entirely on the underlying
filesystem as to how much durability was achieved when simply closing
the file.
Note that this problem existed prior to the recent asynchronous fsync
changes, but those changes do facilitate our performing this durable
offline on rotate without blocking, regardless of the underlying
filesystem sync-on-close semantics.
Previously, when we used a bisection table for seeking through a corrupted
file, and the end of the bisection table was corrupted we'd most likely fail
the entire seek operation. Improve the situation: if we encounter invalid
entries in a bisection table, linearly go backwards until we find a working
entry again.
When we linearly iterate through a corrupted journal file, and we encounter a
read error, don't consider this fatal, but merely as EOF condition (and log
about it).
Early in journal_file_set_offline() f->header->state is tested to see if
it's != STATE_ONLINE, and since there's no need to do anything if the
journal isn't online, the function simply returned here.
Since moving part of the offlining process to a separate thread, there
are two problems here:
1. We can't simply check f->header->state, because if there is an
offline thread active it may modify f->header->state.
2. Even if the journal is deemed offline, the thread responsible may
still need joining, so a bare return may leak the thread's resources
like its stack.
To address #1, the helper journal_file_is_offlining() is called prior to
accessing f->header->state.
If journal_file_is_offlining() returns true, f->header->state isn't even
checked, because an offlining journal is obviously online, and we'll
just continue with the normal set offline code path.
If journal_file_is_offlining() returns false, then it's safe to check
f->header->state, because the offline_state is beyond the point of
modifying f->header->state, and there's a memory barrier in the helper.
If we find f->header->state is != STATE_ONLINE, then we call the
idempotent journal_file_set_offline_thread_join() on the way out of the
function, to join a potential lingering offline thread.
Show the various timestamps in hexadecimal too. This is useful for matching the
timestamps included in cursor strings (which are encoded in hex, too), with the
references in the journal header.
Also, expose this via the "journalctl --file=-" syntax for STDIN. This feature
remains undocumented though, as it is probably not too useful in real-life as
this still requires fds that support mmaping and seeking, i.e. does not work
for pipes, for which reading from STDIN is most commonly used.
Throughout the tree there's spurious use of spaces separating ++ and --
operators from their respective operands. Make ++ and -- operator
consistent with the majority of existing uses; discard the spaces.
When we rotate journals, we must set offline and close the current one,
but don't generally need to wait for this to complete.
Instead, we'll initiate an asynchronous offline via
journal_file_set_offline(oldfile, false), and add the file to a
per-server set of deferred closes to be closed later when they
won't block.
There's one complication however; journal_file_open() via
journal_file_verify_header() assumes that any writable journal in the
online state is the product of an unclean shutdown or other form of
corruption.
Thus there's a need for journal_file_open() to be aware of deferred
closes and synchronize with their completion when opening preexisting
journals for writing. To facilitate this the deferred closes set is
supplied to the journal_file_open() function where the deferred closes
may be closed synchronously before verifying the header in such
circumstances.
This adds a wait flag to journal_file_set_offline(), when false the offline is
performed asynchronously in a separate thread.
When wait is true, if an asynchronous offline is already in-progress it is
restarted and waited for. Otherwise the offline is performed synchronously
without the use of a thread.
journal_file_set_online() cancels or waits for the asynchronous offline to
complete if in-flight, depending on where in the offline process the thread
happens to be. If the thread is in the fsync() phase, it is cancelled and
waiting is unnecessary. Otherwise, the thread is joined before proceeding.
A new offline_state member is added to JournalFile which is used via
atomic operations for communicating between the offline thread and the
journal_file_set_{offline,online}() functions.
ISO/IEC 9899:1999 §7.21.1/2 says:
Where an argument declared as size_t n specifies the length of the array
for a function, n can have the value zero on a call to that
function. Unless explicitly stated otherwise in the description of a
particular function in this subclause, pointer arguments on such a call
shall still have valid values, as described in 7.1.4.
In base64_append_width memcpy was called as memcpy(x, NULL, 0). GCC 4.9
started making use of this and assumes This worked fine under -O0, but
does something strange under -O3.
This patch fixes a bug in base64_append_width(), fixes a possible bug in
journal_file_append_entry_internal(), and makes use of the new function
to simplify the code in other places.
Remove the check that triggers rotation of the journal file when the arriving log entry had a monotonic timestamp smaller that the previous log entry. This check causes unnecessary rotations when journal-remote was receiving from multiple senders, therefore monotonicity can not be guaranteed. Also, it does not offer any useful functionality for systemd-journald.
This primarily contains some minor coding style fixups for 7a24f3bf2f and earlier changes. Specifically:
* Don't log at log levels above LOG_DEBUG from "library" code like journal-file.c
* Don't negate errno values before passing them to log_debug_errno(), as the call can handle this fine anyway
* Cast some calls we knowingly ignore the return values of to (void)
* Don't clobber function call-by-ref return values on failure
* Don't mix function calls and variable declarations in one line
There's also one more relevant change: when failing to enqueue a journal change fs event, we'll run it immediately.
compress_blob took src, src_size, dst and *dst_size, but dst_size
wasn't used as an input parameter with the size of dst, but only as an
output parameter. dst was implicitly assumed to be at least src_size-1.
This code wasn't *wrong*, because the only real caller in
journal-file.c got it right. But it was misleading, and the tests in
test-compress.c got it wrong, and worked only because the output
buffer happened to be the same size as input buffer. So add a seperate
dst_allocated_size parameter to make it explicit what the size of the
buffer is, and to allow test to proceed with different output buffer
sizes.
For low end embedded systems 4 MiB for each journal file is a lot of
memory. Journald will use at least 512 KiB even if JOURNAL_FILE_SIZE_MIN is
set to less than that so just use 512 KiB.
When reading stuff, we should only return EIO when an actual read error
occured, not when we don't like the data for whatever reason.
We already return ENODATA for all other kinds of file truncation, hence
do the same for the most obvious kind, so that callers know what ENODATA
means.