1
0
mirror of https://github.com/Zygo/bees.git synced 2025-05-17 21:35:45 +02:00

docs: update kernel compatibility page, now recommending 5.0.4

* comprehensive list of kernels with bees-triggered corruption bug fixes
 * deadlock between dedupe and rename is now fixed (in some places)
 * compressed data corruption is now fixed (in more places)
 * btrfs send fix for one bug is now merged in 5.2-rc1, another bug remains
 * retired the bcache/lvmcache bug (can't reproduce those bugs any more,
   although I *can* reproduce an interesting non-destructive bcache bug)
 * new minor bug entries for two harmless kernel warnings
 * new entry for storm-of-soft-lockups

Fixes: https://github.com/Zygo/bees/issues/107
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit is contained in:
Zygo Blaxell 2019-05-24 11:16:21 -04:00
parent 978c577412
commit e1476260e1

View File

@ -1,42 +1,81 @@
Recommended kernel version
==========================
Linux **4.14.34** or later.
Currently 5.0.4, 5.1, and *chronologically* later versions are recommended
to avoid all currently known and fixed kernel issues and obtain best
performance. Older kernel versions can be used with bees with some
caveats (see below).
A Brief List Of Btrfs Kernel Bugs
All unmaintained kernel trees (those which do not receive -stable updates)
should be avoided due to potential data corruption bugs.
**Kernels older than 4.2 cannot run bees at all** due to missing features.
DATA CORRUPTION WARNING
-----------------------
There is a data corruption bug in older Linux kernel versions that can
be triggered by bees. The bug can be triggered in other ways, but bees
will trigger it especially often.
This bug is **fixed** in the following kernel versions:
* **5.1 or later** versions.
* **5.0.4 or later 5.0.y** versions.
* **4.19.31 or later 4.19.y** LTS versions.
* **4.14.108 or later 4.14.y** LTS versions.
* **4.9.165 or later 4.9.y** LTS versions.
* **4.4.177 or later 4.4.y** LTS versions.
* **v3.18.137 or later 3.18.y** LTS versions (note these versions cannot
run bees).
All older kernel versions (including 4.20.17, 4.18.20, 4.17.19, 4.16.18,
4.15.18) have the data corruption bug.
The commit that fixes the last known data corruption bug is
8e928218780e2f1cf2f5891c7575e8f0b284fcce "btrfs: fix corruption reading
shared and compressed extents after hole punching".
Lockup/hang WARNING
-------------------
Kernel versions prior to 5.0.4 have a deadlock bug when file A is
renamed to replace B while both files A and B are referenced in a
dedupe operation. This situation may arise often while bees is running,
which will make processes accessing the filesystem hang while writing.
A reboot is required to recover. No data is lost when this occurs
(other than unflushed writes due to the reboot).
A common problem case is rsync receiving updates to large files when not
in `--inplace` mode. If the file is sufficiently large, bees will start
to dedupe the original file and rsync's temporary modified version of
the file while rsync is still writing the modified version of the file.
Later, when rsync renames the modified temporary file over the original
file, the rename in rsync can occasionally deadlock with the dedupe
in bees.
This bug is **fixed** in the following kernel versions:
* **5.1 or later** versions.
* **5.0.4 or later 5.0.y** versions.
The commit that fixes this bug is 4ea748e1d2c9f8a27332b949e8210dbbf392987e
"btrfs: fix deadlock between clone/dedupe and rename".
A Brief List Of btrfs Kernel Bugs
---------------------------------
Recent kernel bug fixes:
* 4.14.29: `WARN_ON(ref->count < 0)` in fs/btrfs/backref.c triggers
almost once per second. The `WARN_ON` is incorrect, and is now removed.
Unfixed kernel bugs (as of 4.14.71):
* **Bad _filesystem destroying_ interactions** with other Linux block
layers: `bcache` and `lvmcache` can fail spectacularly, and apparently
only do so while running bees. This is definitely a kernel bug,
either in btrfs or the lower block layers. **Avoid using bees with
these tools unless your filesystem is disposable and you intend to
debug the kernel.**
* **Compressed data corruption** is possible when using the `fallocate`
system call to punch holes into compressed extents that contain long
runs of zeros. The [bug results in intermittent corruption during
reads](https://www.spinics.net/lists/linux-btrfs/msg81293.html), but
due to the bug, the kernel might sometimes mistakenly determine data
is duplicate, and deduplication will corrupt the data permanently.
This bug also affects compressed `kvm` raw images with the `discard`
feature on btrfs or any compressed file where `fallocate -d` or
`fallocate -p` has been used.
* **Deadlock** when [simultaneously using the same files in dedupe and
`rename`](https://www.spinics.net/lists/linux-btrfs/msg81109.html).
There is no way for bees to reliably know when another process is
about to rename a file while bees is deduping it. In the `rsync` case,
bees will dedupe the new file `rsync` is creating using the old file
`rsync` is copying from, while `rsync` will rename the new file over
the old file to replace it.
Unfixed kernel bugs (as of 5.0.21):
Minor kernel problems with workarounds:
@ -47,30 +86,75 @@ Minor kernel problems with workarounds:
the kernel spends performing `LOGICAL_INO` operations and permanently
blacklisting any extent or hash involved where the kernel starts
to get slow. In the bees log, such blocks are labelled as 'toxic'
hash/block addresses.
hash/block addresses. Toxic extents are rare (about 1 in 100,000
extents become toxic), but toxic extents can become 8 orders of
magnitude more expensive to process than the fastest non-toxic
extents. This seems to affect all dedupe agents on btrfs; at this
time of writing only bees has a workaround for this bug.
* **btrfs send** has various bugs that are triggered when bees is
* **btrfs send** has bugs that are triggered when bees is
deduping snapshots. bees provides the [`--workaround-btrfs-send`
option](options.md) which should be used whenever `btrfs send` and
bees are run on the same filesystem.
This issue affects:
* `btrfs send` (any mode) and bees active at the same time.
* `btrfs send` in incremental mode (using `-p` option) with bees
active at the same or different times.
Note `btrfs receive` is not affected, nor is any other btrfs operation
except `send`. It is OK to run bees with no workarounds on a filesystem
that receives btrfs snapshots.
Note `btrfs receive` is not affected. It is OK to run bees with no
workarounds on a filesystem that receives btrfs snapshots.
A fix for one problem has been [merged into kernel
5.2-rc1](https://github.com/torvalds/linux/commit/62d54f3a7fa27ef6a74d6cdf643ce04beba3afa7).
bees has not been updated to handle the new EAGAIN case optimally,
but the excess error messages that are produced are harmless.
The other problem is that [parent snapshots for incremental sends
are broken by bees](https://github.com/Zygo/bees/issues/115), even
when the snapshots are deduped while send is not running.
* **btrfs send** also seems to have severe performance issues with
dedupe agents that produce toxic extents. bees has a workaround to
prevent this where possible.
* **Systems with many CPU cores** may [lock up when bees runs with one
worker thread for every core](https://github.com/Zygo/bees/issues/91).
bees limits the number of threads it will try to create based on
detected CPU core count. Users may override this limit with the
[`--thread-count` option](options.md).
[`--thread-count` option](options.md). It is possible this is the
same bug as the next one:
Older kernels:
* **Storm of Soft Lockups**, a bug that occurs when running the
`LOGICAL_INO` ioctl in a large number of threads, leads to a soft lockup
on all CPUs. Some details and analysis is available on [the btrfs
mailing list](https://www.spinics.net/lists/linux-btrfs/msg89326.html).
This occurs after hitting a BUG_ON in `fs/btrfs/ctree.c`:
* Older kernels have various data corruption and deadlock/hang issues
that are no longer listed here, and older kernels are missing important
features such as `LOGICAL_INO_V2`. Using an older kernel is not
recommended.
switch (tm->op) {
case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
/* Fallthrough */
The rate of incidence of this bug seems to depend on the total number
of bees threads running on the system, although occasionally other
processes such as `rsync` or `btrfs balance` are involved. A workaround
is to run only 1 bees thread, i.e. [`--thread-count=1`](options.md).
* **Spurious warnings in `fs/fs-writeback.c`** on kernel 4.15 and later
when filesystem is mounted with `flushoncommit`. These
seem to be harmless (there are other locks which prevent
concurrent umount of the filesystem), but the underlying
problems that trigger the `WARN_ON` are [not trivial to
fix](https://www.spinics.net/lists/linux-btrfs/msg87752.html).
Workarounds:
1. mount with `-o noflushoncommit`
2. patch kernel to remove warning in `fs/fs-writeback.c`.
Note that using kernels 4.14 and earlier is *not* a viable workaround
for this issue, because kernels 4.14 and earlier will eventually
deadlock when a filesystem is mounted with `-o flushoncommit` (a single
commit fixes one bug and introduces the other).
* **Spurious kernel warnings in `fs/btrfs/delayed-ref.c`** on 5.0.x.
This also seems harmless, but there have been [no comments
since this issue was reported to the `linux-btrfs` mailing
list](https://www.spinics.net/lists/linux-btrfs/msg89061.html).
Workaround: patch kernel to remove the warning.