docs: update kernel compatibility page, now recommending 5.0.4

* comprehensive list of kernels with bees-triggered corruption bug fixes * deadlock between dedupe and rename is now fixed (in some places) * compressed data corruption is now fixed (in more places) * btrfs send fix for one bug is now merged in 5.2-rc1, another bug remains * retired the bcache/lvmcache bug (can't reproduce those bugs any more, although I *can* reproduce an interesting non-destructive bcache bug) * new minor bug entries for two harmless kernel warnings * new entry for storm-of-soft-lockups Fixes: https://github.com/Zygo/bees/issues/107 Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2026-01-08 20:00:22 +01:00 · 2019-05-24 11:16:21 -04:00
parent 978c577412
commit e1476260e1
1 changed files with 131 additions and 47 deletions
--- a/docs/btrfs-kernel.md
+++ b/docs/btrfs-kernel.md
@@ -1,42 +1,81 @@
 Recommended kernel version
 ==========================

-Linux **4.14.34** or later.
+Currently 5.0.4, 5.1, and *chronologically* later versions are recommended
+to avoid all currently known and fixed kernel issues and obtain best
+performance.  Older kernel versions can be used with bees with some
+caveats (see below).

-A Brief List Of Btrfs Kernel Bugs
+All unmaintained kernel trees (those which do not receive -stable updates)
+should be avoided due to potential data corruption bugs.
+
+**Kernels older than 4.2 cannot run bees at all** due to missing features.
+
+DATA CORRUPTION WARNING
+-----------------------
+
+There is a data corruption bug in older Linux kernel versions that can
+be triggered by bees.  The bug can be triggered in other ways, but bees
+will trigger it especially often.
+
+This bug is **fixed** in the following kernel versions:
+
+* **5.1 or later** versions.
+
+* **5.0.4 or later 5.0.y** versions.
+
+* **4.19.31 or later 4.19.y** LTS versions.
+
+* **4.14.108 or later 4.14.y** LTS versions.
+
+* **4.9.165 or later 4.9.y** LTS versions.
+
+* **4.4.177 or later 4.4.y** LTS versions.
+
+* **v3.18.137 or later 3.18.y** LTS versions (note these versions cannot
+run bees).
+
+All older kernel versions (including 4.20.17, 4.18.20, 4.17.19, 4.16.18,
+4.15.18) have the data corruption bug.
+
+The commit that fixes the last known data corruption bug is
+8e928218780e2f1cf2f5891c7575e8f0b284fcce "btrfs: fix corruption reading
+shared and compressed extents after hole punching".
+
+
+Lockup/hang WARNING
+-------------------
+
+Kernel versions prior to 5.0.4 have a deadlock bug when file A is
+renamed to replace B while both files A and B are referenced in a
+dedupe operation.  This situation may arise often while bees is running,
+which will make processes accessing the filesystem hang while writing.
+A reboot is required to recover.  No data is lost when this occurs
+(other than unflushed writes due to the reboot).
+
+A common problem case is rsync receiving updates to large files when not
+in `--inplace` mode.  If the file is sufficiently large, bees will start
+to dedupe the original file and rsync's temporary modified version of
+the file while rsync is still writing the modified version of the file.
+Later, when rsync renames the modified temporary file over the original
+file, the rename in rsync can occasionally deadlock with the dedupe
+in bees.
+
+This bug is **fixed** in the following kernel versions:
+
+* **5.1 or later** versions.
+
+* **5.0.4 or later 5.0.y** versions.
+
+The commit that fixes this bug is 4ea748e1d2c9f8a27332b949e8210dbbf392987e
+"btrfs: fix deadlock between clone/dedupe and rename".
+
+
+
+A Brief List Of btrfs Kernel Bugs
 ---------------------------------

-Recent kernel bug fixes:
-
-* 4.14.29: `WARN_ON(ref->count < 0)` in fs/btrfs/backref.c triggers
-  almost once per second.  The `WARN_ON` is incorrect, and is now removed.
-
-Unfixed kernel bugs (as of 4.14.71):
-
-* **Bad _filesystem destroying_ interactions** with other Linux block
-  layers:  `bcache` and `lvmcache` can fail spectacularly, and apparently
-  only do so while running bees.  This is definitely a kernel bug,
-  either in btrfs or the lower block layers.  **Avoid using bees with
-  these tools unless your filesystem is disposable and you intend to
-  debug the kernel.**
-
-* **Compressed data corruption** is possible when using the `fallocate`
-  system call to punch holes into compressed extents that contain long
-  runs of zeros.  The [bug results in intermittent corruption during
-  reads](https://www.spinics.net/lists/linux-btrfs/msg81293.html), but
-  due to the bug, the kernel might sometimes mistakenly determine data
-  is duplicate, and deduplication will corrupt the data permanently.
-  This bug also affects compressed `kvm` raw images with the `discard`
-  feature on btrfs or any compressed file where `fallocate -d` or
-  `fallocate -p` has been used.
-
-* **Deadlock** when [simultaneously using the same files in dedupe and
-  `rename`](https://www.spinics.net/lists/linux-btrfs/msg81109.html).
-  There is no way for bees to reliably know when another process is
-  about to rename a file while bees is deduping it.  In the `rsync` case,
-  bees will dedupe the new file `rsync` is creating using the old file
-  `rsync` is copying from, while `rsync` will rename the new file over
-  the old file to replace it.
+Unfixed kernel bugs (as of 5.0.21):

 Minor kernel problems with workarounds:

@@ -47,30 +86,75 @@ Minor kernel problems with workarounds:
  the kernel spends performing `LOGICAL_INO` operations and permanently
  blacklisting any extent or hash involved where the kernel starts
  to get slow.  In the bees log, such blocks are labelled as 'toxic'
-  hash/block addresses.
+  hash/block addresses.  Toxic extents are rare (about 1 in 100,000
+  extents become toxic), but toxic extents can become 8 orders of
+  magnitude more expensive to process than the fastest non-toxic
+  extents.  This seems to affect all dedupe agents on btrfs; at this
+  time of writing only bees has a workaround for this bug.

-* **btrfs send** has various bugs that are triggered when bees is
+* **btrfs send** has bugs that are triggered when bees is
  deduping snapshots.  bees provides the [`--workaround-btrfs-send`
  option](options.md) which should be used whenever `btrfs send` and
  bees are run on the same filesystem.

-  This issue affects:
-   * `btrfs send` (any mode) and bees active at the same time.
-   * `btrfs send` in incremental mode (using `-p` option) with bees
-     active at the same or different times.
+  Note `btrfs receive` is not affected, nor is any other btrfs operation
+  except `send`.  It is OK to run bees with no workarounds on a filesystem
+  that receives btrfs snapshots.

-  Note `btrfs receive` is not affected.  It is OK to run bees with no
-  workarounds on a filesystem that receives btrfs snapshots.
+  A fix for one problem has been [merged into kernel
+  5.2-rc1](https://github.com/torvalds/linux/commit/62d54f3a7fa27ef6a74d6cdf643ce04beba3afa7).
+  bees has not been updated to handle the new EAGAIN case optimally,
+  but the excess error messages that are produced are harmless.
+
+  The other problem is that [parent snapshots for incremental sends
+  are broken by bees](https://github.com/Zygo/bees/issues/115), even
+  when the snapshots are deduped while send is not running.
+
+* **btrfs send** also seems to have severe performance issues with
+  dedupe agents that produce toxic extents.  bees has a workaround to
+  prevent this where possible.

 * **Systems with many CPU cores** may [lock up when bees runs with one
  worker thread for every core](https://github.com/Zygo/bees/issues/91).
  bees limits the number of threads it will try to create based on
  detected CPU core count.  Users may override this limit with the
-  [`--thread-count` option](options.md).
+  [`--thread-count` option](options.md).  It is possible this is the
+  same bug as the next one:

-Older kernels:
+* **Storm of Soft Lockups**, a bug that occurs when running the
+  `LOGICAL_INO` ioctl in a large number of threads, leads to a soft lockup
+  on all CPUs.  Some details and analysis is available on [the btrfs
+  mailing list](https://www.spinics.net/lists/linux-btrfs/msg89326.html).
+  This occurs after hitting a BUG_ON in `fs/btrfs/ctree.c`:

-* Older kernels have various data corruption and deadlock/hang issues
-  that are no longer listed here, and older kernels are missing important
-  features such as `LOGICAL_INO_V2`.  Using an older kernel is not
-  recommended.
+        switch (tm->op) {
+                case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
+                        BUG_ON(tm->slot < n);
+                        /* Fallthrough */
+
+  The rate of incidence of this bug seems to depend on the total number
+  of bees threads running on the system, although occasionally other
+  processes such as `rsync` or `btrfs balance` are involved.  A workaround
+  is to run only 1 bees thread, i.e.  [`--thread-count=1`](options.md).
+
+* **Spurious warnings in `fs/fs-writeback.c`** on kernel 4.15 and later
+  when filesystem is mounted with `flushoncommit`.  These
+  seem to be harmless (there are other locks which prevent
+  concurrent umount of the filesystem), but the underlying
+  problems that trigger the `WARN_ON` are [not trivial to
+  fix](https://www.spinics.net/lists/linux-btrfs/msg87752.html).
+  Workarounds:
+
+  1. mount with `-o noflushoncommit`
+  2. patch kernel to remove warning in `fs/fs-writeback.c`.
+
+  Note that using kernels 4.14 and earlier is *not* a viable workaround
+  for this issue, because kernels 4.14 and earlier will eventually
+  deadlock when a filesystem is mounted with `-o flushoncommit` (a single
+  commit fixes one bug and introduces the other).
+
+* **Spurious kernel warnings in `fs/btrfs/delayed-ref.c`** on 5.0.x.
+  This also seems harmless, but there have been [no comments
+  since this issue was reported to the `linux-btrfs` mailing
+  list](https://www.spinics.net/lists/linux-btrfs/msg89061.html).
+  Workaround:  patch kernel to remove the warning.