1
0
mirror of https://github.com/Zygo/bees.git synced 2025-05-18 05:45:45 +02:00

docs: toxic extents and btrfs send

Update documentation of toxic extent / slow backref workaround.

Add notes about btrfs send kernel bugs and incremental send failures.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit is contained in:
Zygo Blaxell 2018-11-06 00:56:06 -05:00
parent 688d0dc014
commit 19859b0a0d
3 changed files with 27 additions and 28 deletions

View File

@ -38,19 +38,23 @@ Unfixed kernel bugs (as of 4.14.71):
`rsync` is copying from, while `rsync` will rename the new file over `rsync` is copying from, while `rsync` will rename the new file over
the old file to replace it. the old file to replace it.
* **btrfs send** has various problems when bees is deduping RO snapshots,
especially if the snapshot is used as a parent for incremental send.
Minor kernel problems with workarounds: Minor kernel problems with workarounds:
* **Slow backrefs** (aka toxic extents): If the number of references to a * **Slow backrefs** (aka toxic extents): Under certain conditions,
single shared extent within a single file grows above a few thousand, if the number of references to a single shared extent grows too high,
the kernel consumes CPU for minutes at a time while holding various the kernel consumes more and more CPU while holding locks that block
locks that block access to the filesystem. bees avoids this bug access to the filesystem. bees avoids this bug by measuring the time
by measuring the time the kernel spends performing `LOGICAL_INO` the kernel spends performing `LOGICAL_INO` operations and permanently
operations and permanently blacklisting any extent or hash involved blacklisting any extent or hash involved where the kernel starts
where the kernel starts to get slow. Inside bees, such blocks are to get slow. In the bees log, such blocks are labelled as 'toxic'
known as 'toxic' hash/block addresses. hash/block addresses.
* **`FILE_EXTENT_SAME` is arbitrarily limited to 16MB**. This is Older kernels:
less than 128MB which is the maximum extent size that can be created
by defrag, prealloc, or filesystems without the `compress-force` * Older kernels have various data corruption and deadlock/hang issues
mount option. bees avoids feedback loops this can generate while that are no longer listed here, and older kernels are missing important
attempting to replace extents over 16MB in length. features such as `LOGICAL_INO_V2`. Using an older kernel is not
recommended.

View File

@ -31,7 +31,8 @@ bees has been tested in combination with the following, and various problems are
* btrfs send: some kernel versions have bugs in btrfs send that can be * btrfs send: some kernel versions have bugs in btrfs send that can be
triggered by bees. The send can be restarted and will work if bees triggered by bees. The send can be restarted and will work if bees
has finished processing the snapshot being sent. No data corruption has finished processing the snapshot being sent. No data corruption
observed other than the truncated send. observed other than the truncated send. Incremental send doesn't seem
to work with bees running on the sending side.
* btrfs qgroups: very slow, sometimes hangs...and it's even worse when * btrfs qgroups: very slow, sometimes hangs...and it's even worse when
bees is running. bees is running.
* btrfs autodefrag mount option: hangs and high CPU usage problems * btrfs autodefrag mount option: hangs and high CPU usage problems

View File

@ -78,22 +78,16 @@ Other Gotchas
------------- -------------
* bees avoids the [slow backrefs kernel bug](btrfs-kernel.md) by * bees avoids the [slow backrefs kernel bug](btrfs-kernel.md) by
measuring the time required to perform `LOGICAL_INO` operations. If an measuring the time required to perform `LOGICAL_INO` operations.
extent requires over 10 seconds to perform a `LOGICAL_INO` then bees If an extent requires over 0.1 kernel CPU seconds to perform a
blacklists the extent and avoids referencing it in future operations. `LOGICAL_INO` ioctl, then bees blacklists the extent and avoids
In most cases, fewer than 0.1% of extents in a filesystem must be referencing it in future operations. In most cases, fewer than 0.1%
avoided this way. This results in short write latency spikes of up of extents in a filesystem must be avoided this way. This results
to and a little over 10 seconds as btrfs will not allow writes to the in short write latency spikes as btrfs will not allow writes to the
filesystem while `LOGICAL_INO` is running. Generally the CPU spends filesystem while `LOGICAL_INO` is running. Generally the CPU spends
most of the runtime of the `LOGICAL_INO` ioctl running the kernel, most of the runtime of the `LOGICAL_INO` ioctl running the kernel,
so on a single-core CPU the entire system can freeze up for a few so on a single-core CPU the entire system can freeze up for a second
seconds at a time. during operations on toxic extents.
* Load managers that send a `SIGSTOP` to the bees process to throttle
CPU usage may affect the `LOGICAL_INO` timing mechanism, causing extents
to be incorrectly labelled 'toxic'. This will cause a small reduction
of dedupe hit rate. Slow and heavily loaded disks can trigger the same
effect if `LOGICAL_INO` takes too long due to IO latency.
* If a process holds a directory FD open, the subvol containing the * If a process holds a directory FD open, the subvol containing the
directory cannot be deleted (`btrfs sub del` will start the deletion directory cannot be deleted (`btrfs sub del` will start the deletion