From 19859b0a0dbb9eecda0a777fb2b2041d1f7f9704 Mon Sep 17 00:00:00 2001 From: Zygo Blaxell Date: Tue, 6 Nov 2018 00:56:06 -0500 Subject: [PATCH] docs: toxic extents and btrfs send Update documentation of toxic extent / slow backref workaround. Add notes about btrfs send kernel bugs and incremental send failures. Signed-off-by: Zygo Blaxell --- docs/btrfs-kernel.md | 30 +++++++++++++++++------------- docs/btrfs-other.md | 3 ++- docs/gotchas.md | 22 ++++++++-------------- 3 files changed, 27 insertions(+), 28 deletions(-) diff --git a/docs/btrfs-kernel.md b/docs/btrfs-kernel.md index f111a85..2396604 100644 --- a/docs/btrfs-kernel.md +++ b/docs/btrfs-kernel.md @@ -38,19 +38,23 @@ Unfixed kernel bugs (as of 4.14.71): `rsync` is copying from, while `rsync` will rename the new file over the old file to replace it. +* **btrfs send** has various problems when bees is deduping RO snapshots, + especially if the snapshot is used as a parent for incremental send. + Minor kernel problems with workarounds: -* **Slow backrefs** (aka toxic extents): If the number of references to a - single shared extent within a single file grows above a few thousand, - the kernel consumes CPU for minutes at a time while holding various - locks that block access to the filesystem. bees avoids this bug - by measuring the time the kernel spends performing `LOGICAL_INO` - operations and permanently blacklisting any extent or hash involved - where the kernel starts to get slow. Inside bees, such blocks are - known as 'toxic' hash/block addresses. +* **Slow backrefs** (aka toxic extents): Under certain conditions, + if the number of references to a single shared extent grows too high, + the kernel consumes more and more CPU while holding locks that block + access to the filesystem. bees avoids this bug by measuring the time + the kernel spends performing `LOGICAL_INO` operations and permanently + blacklisting any extent or hash involved where the kernel starts + to get slow. In the bees log, such blocks are labelled as 'toxic' + hash/block addresses. -* **`FILE_EXTENT_SAME` is arbitrarily limited to 16MB**. This is - less than 128MB which is the maximum extent size that can be created - by defrag, prealloc, or filesystems without the `compress-force` - mount option. bees avoids feedback loops this can generate while - attempting to replace extents over 16MB in length. +Older kernels: + +* Older kernels have various data corruption and deadlock/hang issues + that are no longer listed here, and older kernels are missing important + features such as `LOGICAL_INO_V2`. Using an older kernel is not + recommended. diff --git a/docs/btrfs-other.md b/docs/btrfs-other.md index a89cb09..0ca961f 100644 --- a/docs/btrfs-other.md +++ b/docs/btrfs-other.md @@ -31,7 +31,8 @@ bees has been tested in combination with the following, and various problems are * btrfs send: some kernel versions have bugs in btrfs send that can be triggered by bees. The send can be restarted and will work if bees has finished processing the snapshot being sent. No data corruption - observed other than the truncated send. + observed other than the truncated send. Incremental send doesn't seem + to work with bees running on the sending side. * btrfs qgroups: very slow, sometimes hangs...and it's even worse when bees is running. * btrfs autodefrag mount option: hangs and high CPU usage problems diff --git a/docs/gotchas.md b/docs/gotchas.md index 2712ac7..1078916 100644 --- a/docs/gotchas.md +++ b/docs/gotchas.md @@ -78,22 +78,16 @@ Other Gotchas ------------- * bees avoids the [slow backrefs kernel bug](btrfs-kernel.md) by - measuring the time required to perform `LOGICAL_INO` operations. If an - extent requires over 10 seconds to perform a `LOGICAL_INO` then bees - blacklists the extent and avoids referencing it in future operations. - In most cases, fewer than 0.1% of extents in a filesystem must be - avoided this way. This results in short write latency spikes of up - to and a little over 10 seconds as btrfs will not allow writes to the + measuring the time required to perform `LOGICAL_INO` operations. + If an extent requires over 0.1 kernel CPU seconds to perform a + `LOGICAL_INO` ioctl, then bees blacklists the extent and avoids + referencing it in future operations. In most cases, fewer than 0.1% + of extents in a filesystem must be avoided this way. This results + in short write latency spikes as btrfs will not allow writes to the filesystem while `LOGICAL_INO` is running. Generally the CPU spends most of the runtime of the `LOGICAL_INO` ioctl running the kernel, - so on a single-core CPU the entire system can freeze up for a few - seconds at a time. - -* Load managers that send a `SIGSTOP` to the bees process to throttle - CPU usage may affect the `LOGICAL_INO` timing mechanism, causing extents - to be incorrectly labelled 'toxic'. This will cause a small reduction - of dedupe hit rate. Slow and heavily loaded disks can trigger the same - effect if `LOGICAL_INO` takes too long due to IO latency. + so on a single-core CPU the entire system can freeze up for a second + during operations on toxic extents. * If a process holds a directory FD open, the subvol containing the directory cannot be deleted (`btrfs sub del` will start the deletion