1
0
mirror of https://github.com/Zygo/bees.git synced 2025-05-17 21:35:45 +02:00

docs: toxic extents and btrfs send

Update documentation of toxic extent / slow backref workaround.

Add notes about btrfs send kernel bugs and incremental send failures.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit is contained in:
Zygo Blaxell 2018-11-06 00:56:06 -05:00
parent 688d0dc014
commit 19859b0a0d
3 changed files with 27 additions and 28 deletions

View File

@ -38,19 +38,23 @@ Unfixed kernel bugs (as of 4.14.71):
`rsync` is copying from, while `rsync` will rename the new file over
the old file to replace it.
* **btrfs send** has various problems when bees is deduping RO snapshots,
especially if the snapshot is used as a parent for incremental send.
Minor kernel problems with workarounds:
* **Slow backrefs** (aka toxic extents): If the number of references to a
single shared extent within a single file grows above a few thousand,
the kernel consumes CPU for minutes at a time while holding various
locks that block access to the filesystem. bees avoids this bug
by measuring the time the kernel spends performing `LOGICAL_INO`
operations and permanently blacklisting any extent or hash involved
where the kernel starts to get slow. Inside bees, such blocks are
known as 'toxic' hash/block addresses.
* **Slow backrefs** (aka toxic extents): Under certain conditions,
if the number of references to a single shared extent grows too high,
the kernel consumes more and more CPU while holding locks that block
access to the filesystem. bees avoids this bug by measuring the time
the kernel spends performing `LOGICAL_INO` operations and permanently
blacklisting any extent or hash involved where the kernel starts
to get slow. In the bees log, such blocks are labelled as 'toxic'
hash/block addresses.
* **`FILE_EXTENT_SAME` is arbitrarily limited to 16MB**. This is
less than 128MB which is the maximum extent size that can be created
by defrag, prealloc, or filesystems without the `compress-force`
mount option. bees avoids feedback loops this can generate while
attempting to replace extents over 16MB in length.
Older kernels:
* Older kernels have various data corruption and deadlock/hang issues
that are no longer listed here, and older kernels are missing important
features such as `LOGICAL_INO_V2`. Using an older kernel is not
recommended.

View File

@ -31,7 +31,8 @@ bees has been tested in combination with the following, and various problems are
* btrfs send: some kernel versions have bugs in btrfs send that can be
triggered by bees. The send can be restarted and will work if bees
has finished processing the snapshot being sent. No data corruption
observed other than the truncated send.
observed other than the truncated send. Incremental send doesn't seem
to work with bees running on the sending side.
* btrfs qgroups: very slow, sometimes hangs...and it's even worse when
bees is running.
* btrfs autodefrag mount option: hangs and high CPU usage problems

View File

@ -78,22 +78,16 @@ Other Gotchas
-------------
* bees avoids the [slow backrefs kernel bug](btrfs-kernel.md) by
measuring the time required to perform `LOGICAL_INO` operations. If an
extent requires over 10 seconds to perform a `LOGICAL_INO` then bees
blacklists the extent and avoids referencing it in future operations.
In most cases, fewer than 0.1% of extents in a filesystem must be
avoided this way. This results in short write latency spikes of up
to and a little over 10 seconds as btrfs will not allow writes to the
measuring the time required to perform `LOGICAL_INO` operations.
If an extent requires over 0.1 kernel CPU seconds to perform a
`LOGICAL_INO` ioctl, then bees blacklists the extent and avoids
referencing it in future operations. In most cases, fewer than 0.1%
of extents in a filesystem must be avoided this way. This results
in short write latency spikes as btrfs will not allow writes to the
filesystem while `LOGICAL_INO` is running. Generally the CPU spends
most of the runtime of the `LOGICAL_INO` ioctl running the kernel,
so on a single-core CPU the entire system can freeze up for a few
seconds at a time.
* Load managers that send a `SIGSTOP` to the bees process to throttle
CPU usage may affect the `LOGICAL_INO` timing mechanism, causing extents
to be incorrectly labelled 'toxic'. This will cause a small reduction
of dedupe hit rate. Slow and heavily loaded disks can trigger the same
effect if `LOGICAL_INO` takes too long due to IO latency.
so on a single-core CPU the entire system can freeze up for a second
during operations on toxic extents.
* If a process holds a directory FD open, the subvol containing the
directory cannot be deleted (`btrfs sub del` will start the deletion