journal: in some cases we have to decompress the full lz4 field

lz4 has to decompress a whole "sequence" at a time. When the compressed
data is composed of a repeating pattern, the whole set of repeats has
do be docompressed, and the output buffer has to be big enough.

This is unfortunate, because potentially the slowdown is very big. We
are only interested in the field name, but we might have to decompress
the whole thing. But the full cost will be borne out only when the
full entry is a repeating pattern. In practice this shouldn't happen
(apart from tests and the like). Hopefully lz4 will be fixed to avoid
this problem, or it will grow a new function which we can use [1], so
this fix should be remporary.

[1] https://groups.google.com/d/msg/lz4c/_3kkz5N6n00/oTahzqErCgAJ
This commit is contained in:
Zbigniew Jędrzejewski-Szmek 2015-12-11 09:10:33 -05:00
parent 2aaec9b4f6
commit 1f4b467daa

View file

@ -306,6 +306,7 @@ int decompress_startswith_lz4(const void *src, uint64_t src_size,
* prefix */
int r;
size_t size;
assert(src);
assert(src_size > 0);
@ -322,10 +323,18 @@ int decompress_startswith_lz4(const void *src, uint64_t src_size,
r = LZ4_decompress_safe_partial(src + 8, *buffer, src_size - 8,
prefix_len + 1, *buffer_size);
if (r >= 0)
size = (unsigned) r;
else {
/* lz4 always tries to decode full "sequence", so in
* pathological cases might need to decompress the
* full field. */
r = decompress_blob_lz4(src, src_size, buffer, buffer_size, &size, 0);
if (r < 0)
return r;
}
if (r < 0)
return -EBADMSG;
if ((unsigned) r >= prefix_len + 1)
if (size >= prefix_len + 1)
return memcmp(*buffer, prefix, prefix_len) == 0 &&
((const uint8_t*) *buffer)[prefix_len] == extra;
else