mirror of
https://github.com/Zygo/bees.git
synced 2025-05-17 21:35:45 +02:00
docs: update kernel compatibility page, now recommending 5.0.4
* comprehensive list of kernels with bees-triggered corruption bug fixes * deadlock between dedupe and rename is now fixed (in some places) * compressed data corruption is now fixed (in more places) * btrfs send fix for one bug is now merged in 5.2-rc1, another bug remains * retired the bcache/lvmcache bug (can't reproduce those bugs any more, although I *can* reproduce an interesting non-destructive bcache bug) * new minor bug entries for two harmless kernel warnings * new entry for storm-of-soft-lockups Fixes: https://github.com/Zygo/bees/issues/107 Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit is contained in:
parent
978c577412
commit
e1476260e1
@ -1,42 +1,81 @@
|
|||||||
Recommended kernel version
|
Recommended kernel version
|
||||||
==========================
|
==========================
|
||||||
|
|
||||||
Linux **4.14.34** or later.
|
Currently 5.0.4, 5.1, and *chronologically* later versions are recommended
|
||||||
|
to avoid all currently known and fixed kernel issues and obtain best
|
||||||
|
performance. Older kernel versions can be used with bees with some
|
||||||
|
caveats (see below).
|
||||||
|
|
||||||
A Brief List Of Btrfs Kernel Bugs
|
All unmaintained kernel trees (those which do not receive -stable updates)
|
||||||
|
should be avoided due to potential data corruption bugs.
|
||||||
|
|
||||||
|
**Kernels older than 4.2 cannot run bees at all** due to missing features.
|
||||||
|
|
||||||
|
DATA CORRUPTION WARNING
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
There is a data corruption bug in older Linux kernel versions that can
|
||||||
|
be triggered by bees. The bug can be triggered in other ways, but bees
|
||||||
|
will trigger it especially often.
|
||||||
|
|
||||||
|
This bug is **fixed** in the following kernel versions:
|
||||||
|
|
||||||
|
* **5.1 or later** versions.
|
||||||
|
|
||||||
|
* **5.0.4 or later 5.0.y** versions.
|
||||||
|
|
||||||
|
* **4.19.31 or later 4.19.y** LTS versions.
|
||||||
|
|
||||||
|
* **4.14.108 or later 4.14.y** LTS versions.
|
||||||
|
|
||||||
|
* **4.9.165 or later 4.9.y** LTS versions.
|
||||||
|
|
||||||
|
* **4.4.177 or later 4.4.y** LTS versions.
|
||||||
|
|
||||||
|
* **v3.18.137 or later 3.18.y** LTS versions (note these versions cannot
|
||||||
|
run bees).
|
||||||
|
|
||||||
|
All older kernel versions (including 4.20.17, 4.18.20, 4.17.19, 4.16.18,
|
||||||
|
4.15.18) have the data corruption bug.
|
||||||
|
|
||||||
|
The commit that fixes the last known data corruption bug is
|
||||||
|
8e928218780e2f1cf2f5891c7575e8f0b284fcce "btrfs: fix corruption reading
|
||||||
|
shared and compressed extents after hole punching".
|
||||||
|
|
||||||
|
|
||||||
|
Lockup/hang WARNING
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
Kernel versions prior to 5.0.4 have a deadlock bug when file A is
|
||||||
|
renamed to replace B while both files A and B are referenced in a
|
||||||
|
dedupe operation. This situation may arise often while bees is running,
|
||||||
|
which will make processes accessing the filesystem hang while writing.
|
||||||
|
A reboot is required to recover. No data is lost when this occurs
|
||||||
|
(other than unflushed writes due to the reboot).
|
||||||
|
|
||||||
|
A common problem case is rsync receiving updates to large files when not
|
||||||
|
in `--inplace` mode. If the file is sufficiently large, bees will start
|
||||||
|
to dedupe the original file and rsync's temporary modified version of
|
||||||
|
the file while rsync is still writing the modified version of the file.
|
||||||
|
Later, when rsync renames the modified temporary file over the original
|
||||||
|
file, the rename in rsync can occasionally deadlock with the dedupe
|
||||||
|
in bees.
|
||||||
|
|
||||||
|
This bug is **fixed** in the following kernel versions:
|
||||||
|
|
||||||
|
* **5.1 or later** versions.
|
||||||
|
|
||||||
|
* **5.0.4 or later 5.0.y** versions.
|
||||||
|
|
||||||
|
The commit that fixes this bug is 4ea748e1d2c9f8a27332b949e8210dbbf392987e
|
||||||
|
"btrfs: fix deadlock between clone/dedupe and rename".
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
A Brief List Of btrfs Kernel Bugs
|
||||||
---------------------------------
|
---------------------------------
|
||||||
|
|
||||||
Recent kernel bug fixes:
|
Unfixed kernel bugs (as of 5.0.21):
|
||||||
|
|
||||||
* 4.14.29: `WARN_ON(ref->count < 0)` in fs/btrfs/backref.c triggers
|
|
||||||
almost once per second. The `WARN_ON` is incorrect, and is now removed.
|
|
||||||
|
|
||||||
Unfixed kernel bugs (as of 4.14.71):
|
|
||||||
|
|
||||||
* **Bad _filesystem destroying_ interactions** with other Linux block
|
|
||||||
layers: `bcache` and `lvmcache` can fail spectacularly, and apparently
|
|
||||||
only do so while running bees. This is definitely a kernel bug,
|
|
||||||
either in btrfs or the lower block layers. **Avoid using bees with
|
|
||||||
these tools unless your filesystem is disposable and you intend to
|
|
||||||
debug the kernel.**
|
|
||||||
|
|
||||||
* **Compressed data corruption** is possible when using the `fallocate`
|
|
||||||
system call to punch holes into compressed extents that contain long
|
|
||||||
runs of zeros. The [bug results in intermittent corruption during
|
|
||||||
reads](https://www.spinics.net/lists/linux-btrfs/msg81293.html), but
|
|
||||||
due to the bug, the kernel might sometimes mistakenly determine data
|
|
||||||
is duplicate, and deduplication will corrupt the data permanently.
|
|
||||||
This bug also affects compressed `kvm` raw images with the `discard`
|
|
||||||
feature on btrfs or any compressed file where `fallocate -d` or
|
|
||||||
`fallocate -p` has been used.
|
|
||||||
|
|
||||||
* **Deadlock** when [simultaneously using the same files in dedupe and
|
|
||||||
`rename`](https://www.spinics.net/lists/linux-btrfs/msg81109.html).
|
|
||||||
There is no way for bees to reliably know when another process is
|
|
||||||
about to rename a file while bees is deduping it. In the `rsync` case,
|
|
||||||
bees will dedupe the new file `rsync` is creating using the old file
|
|
||||||
`rsync` is copying from, while `rsync` will rename the new file over
|
|
||||||
the old file to replace it.
|
|
||||||
|
|
||||||
Minor kernel problems with workarounds:
|
Minor kernel problems with workarounds:
|
||||||
|
|
||||||
@ -47,30 +86,75 @@ Minor kernel problems with workarounds:
|
|||||||
the kernel spends performing `LOGICAL_INO` operations and permanently
|
the kernel spends performing `LOGICAL_INO` operations and permanently
|
||||||
blacklisting any extent or hash involved where the kernel starts
|
blacklisting any extent or hash involved where the kernel starts
|
||||||
to get slow. In the bees log, such blocks are labelled as 'toxic'
|
to get slow. In the bees log, such blocks are labelled as 'toxic'
|
||||||
hash/block addresses.
|
hash/block addresses. Toxic extents are rare (about 1 in 100,000
|
||||||
|
extents become toxic), but toxic extents can become 8 orders of
|
||||||
|
magnitude more expensive to process than the fastest non-toxic
|
||||||
|
extents. This seems to affect all dedupe agents on btrfs; at this
|
||||||
|
time of writing only bees has a workaround for this bug.
|
||||||
|
|
||||||
* **btrfs send** has various bugs that are triggered when bees is
|
* **btrfs send** has bugs that are triggered when bees is
|
||||||
deduping snapshots. bees provides the [`--workaround-btrfs-send`
|
deduping snapshots. bees provides the [`--workaround-btrfs-send`
|
||||||
option](options.md) which should be used whenever `btrfs send` and
|
option](options.md) which should be used whenever `btrfs send` and
|
||||||
bees are run on the same filesystem.
|
bees are run on the same filesystem.
|
||||||
|
|
||||||
This issue affects:
|
Note `btrfs receive` is not affected, nor is any other btrfs operation
|
||||||
* `btrfs send` (any mode) and bees active at the same time.
|
except `send`. It is OK to run bees with no workarounds on a filesystem
|
||||||
* `btrfs send` in incremental mode (using `-p` option) with bees
|
that receives btrfs snapshots.
|
||||||
active at the same or different times.
|
|
||||||
|
|
||||||
Note `btrfs receive` is not affected. It is OK to run bees with no
|
A fix for one problem has been [merged into kernel
|
||||||
workarounds on a filesystem that receives btrfs snapshots.
|
5.2-rc1](https://github.com/torvalds/linux/commit/62d54f3a7fa27ef6a74d6cdf643ce04beba3afa7).
|
||||||
|
bees has not been updated to handle the new EAGAIN case optimally,
|
||||||
|
but the excess error messages that are produced are harmless.
|
||||||
|
|
||||||
|
The other problem is that [parent snapshots for incremental sends
|
||||||
|
are broken by bees](https://github.com/Zygo/bees/issues/115), even
|
||||||
|
when the snapshots are deduped while send is not running.
|
||||||
|
|
||||||
|
* **btrfs send** also seems to have severe performance issues with
|
||||||
|
dedupe agents that produce toxic extents. bees has a workaround to
|
||||||
|
prevent this where possible.
|
||||||
|
|
||||||
* **Systems with many CPU cores** may [lock up when bees runs with one
|
* **Systems with many CPU cores** may [lock up when bees runs with one
|
||||||
worker thread for every core](https://github.com/Zygo/bees/issues/91).
|
worker thread for every core](https://github.com/Zygo/bees/issues/91).
|
||||||
bees limits the number of threads it will try to create based on
|
bees limits the number of threads it will try to create based on
|
||||||
detected CPU core count. Users may override this limit with the
|
detected CPU core count. Users may override this limit with the
|
||||||
[`--thread-count` option](options.md).
|
[`--thread-count` option](options.md). It is possible this is the
|
||||||
|
same bug as the next one:
|
||||||
|
|
||||||
Older kernels:
|
* **Storm of Soft Lockups**, a bug that occurs when running the
|
||||||
|
`LOGICAL_INO` ioctl in a large number of threads, leads to a soft lockup
|
||||||
|
on all CPUs. Some details and analysis is available on [the btrfs
|
||||||
|
mailing list](https://www.spinics.net/lists/linux-btrfs/msg89326.html).
|
||||||
|
This occurs after hitting a BUG_ON in `fs/btrfs/ctree.c`:
|
||||||
|
|
||||||
* Older kernels have various data corruption and deadlock/hang issues
|
switch (tm->op) {
|
||||||
that are no longer listed here, and older kernels are missing important
|
case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
|
||||||
features such as `LOGICAL_INO_V2`. Using an older kernel is not
|
BUG_ON(tm->slot < n);
|
||||||
recommended.
|
/* Fallthrough */
|
||||||
|
|
||||||
|
The rate of incidence of this bug seems to depend on the total number
|
||||||
|
of bees threads running on the system, although occasionally other
|
||||||
|
processes such as `rsync` or `btrfs balance` are involved. A workaround
|
||||||
|
is to run only 1 bees thread, i.e. [`--thread-count=1`](options.md).
|
||||||
|
|
||||||
|
* **Spurious warnings in `fs/fs-writeback.c`** on kernel 4.15 and later
|
||||||
|
when filesystem is mounted with `flushoncommit`. These
|
||||||
|
seem to be harmless (there are other locks which prevent
|
||||||
|
concurrent umount of the filesystem), but the underlying
|
||||||
|
problems that trigger the `WARN_ON` are [not trivial to
|
||||||
|
fix](https://www.spinics.net/lists/linux-btrfs/msg87752.html).
|
||||||
|
Workarounds:
|
||||||
|
|
||||||
|
1. mount with `-o noflushoncommit`
|
||||||
|
2. patch kernel to remove warning in `fs/fs-writeback.c`.
|
||||||
|
|
||||||
|
Note that using kernels 4.14 and earlier is *not* a viable workaround
|
||||||
|
for this issue, because kernels 4.14 and earlier will eventually
|
||||||
|
deadlock when a filesystem is mounted with `-o flushoncommit` (a single
|
||||||
|
commit fixes one bug and introduces the other).
|
||||||
|
|
||||||
|
* **Spurious kernel warnings in `fs/btrfs/delayed-ref.c`** on 5.0.x.
|
||||||
|
This also seems harmless, but there have been [no comments
|
||||||
|
since this issue was reported to the `linux-btrfs` mailing
|
||||||
|
list](https://www.spinics.net/lists/linux-btrfs/msg89061.html).
|
||||||
|
Workaround: patch kernel to remove the warning.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user