mirror of
https://github.com/Zygo/bees.git
synced 2025-05-17 21:35:45 +02:00
docs: toxic extents and btrfs send
Update documentation of toxic extent / slow backref workaround. Add notes about btrfs send kernel bugs and incremental send failures. Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit is contained in:
parent
688d0dc014
commit
19859b0a0d
@ -38,19 +38,23 @@ Unfixed kernel bugs (as of 4.14.71):
|
|||||||
`rsync` is copying from, while `rsync` will rename the new file over
|
`rsync` is copying from, while `rsync` will rename the new file over
|
||||||
the old file to replace it.
|
the old file to replace it.
|
||||||
|
|
||||||
|
* **btrfs send** has various problems when bees is deduping RO snapshots,
|
||||||
|
especially if the snapshot is used as a parent for incremental send.
|
||||||
|
|
||||||
Minor kernel problems with workarounds:
|
Minor kernel problems with workarounds:
|
||||||
|
|
||||||
* **Slow backrefs** (aka toxic extents): If the number of references to a
|
* **Slow backrefs** (aka toxic extents): Under certain conditions,
|
||||||
single shared extent within a single file grows above a few thousand,
|
if the number of references to a single shared extent grows too high,
|
||||||
the kernel consumes CPU for minutes at a time while holding various
|
the kernel consumes more and more CPU while holding locks that block
|
||||||
locks that block access to the filesystem. bees avoids this bug
|
access to the filesystem. bees avoids this bug by measuring the time
|
||||||
by measuring the time the kernel spends performing `LOGICAL_INO`
|
the kernel spends performing `LOGICAL_INO` operations and permanently
|
||||||
operations and permanently blacklisting any extent or hash involved
|
blacklisting any extent or hash involved where the kernel starts
|
||||||
where the kernel starts to get slow. Inside bees, such blocks are
|
to get slow. In the bees log, such blocks are labelled as 'toxic'
|
||||||
known as 'toxic' hash/block addresses.
|
hash/block addresses.
|
||||||
|
|
||||||
* **`FILE_EXTENT_SAME` is arbitrarily limited to 16MB**. This is
|
Older kernels:
|
||||||
less than 128MB which is the maximum extent size that can be created
|
|
||||||
by defrag, prealloc, or filesystems without the `compress-force`
|
* Older kernels have various data corruption and deadlock/hang issues
|
||||||
mount option. bees avoids feedback loops this can generate while
|
that are no longer listed here, and older kernels are missing important
|
||||||
attempting to replace extents over 16MB in length.
|
features such as `LOGICAL_INO_V2`. Using an older kernel is not
|
||||||
|
recommended.
|
||||||
|
@ -31,7 +31,8 @@ bees has been tested in combination with the following, and various problems are
|
|||||||
* btrfs send: some kernel versions have bugs in btrfs send that can be
|
* btrfs send: some kernel versions have bugs in btrfs send that can be
|
||||||
triggered by bees. The send can be restarted and will work if bees
|
triggered by bees. The send can be restarted and will work if bees
|
||||||
has finished processing the snapshot being sent. No data corruption
|
has finished processing the snapshot being sent. No data corruption
|
||||||
observed other than the truncated send.
|
observed other than the truncated send. Incremental send doesn't seem
|
||||||
|
to work with bees running on the sending side.
|
||||||
* btrfs qgroups: very slow, sometimes hangs...and it's even worse when
|
* btrfs qgroups: very slow, sometimes hangs...and it's even worse when
|
||||||
bees is running.
|
bees is running.
|
||||||
* btrfs autodefrag mount option: hangs and high CPU usage problems
|
* btrfs autodefrag mount option: hangs and high CPU usage problems
|
||||||
|
@ -78,22 +78,16 @@ Other Gotchas
|
|||||||
-------------
|
-------------
|
||||||
|
|
||||||
* bees avoids the [slow backrefs kernel bug](btrfs-kernel.md) by
|
* bees avoids the [slow backrefs kernel bug](btrfs-kernel.md) by
|
||||||
measuring the time required to perform `LOGICAL_INO` operations. If an
|
measuring the time required to perform `LOGICAL_INO` operations.
|
||||||
extent requires over 10 seconds to perform a `LOGICAL_INO` then bees
|
If an extent requires over 0.1 kernel CPU seconds to perform a
|
||||||
blacklists the extent and avoids referencing it in future operations.
|
`LOGICAL_INO` ioctl, then bees blacklists the extent and avoids
|
||||||
In most cases, fewer than 0.1% of extents in a filesystem must be
|
referencing it in future operations. In most cases, fewer than 0.1%
|
||||||
avoided this way. This results in short write latency spikes of up
|
of extents in a filesystem must be avoided this way. This results
|
||||||
to and a little over 10 seconds as btrfs will not allow writes to the
|
in short write latency spikes as btrfs will not allow writes to the
|
||||||
filesystem while `LOGICAL_INO` is running. Generally the CPU spends
|
filesystem while `LOGICAL_INO` is running. Generally the CPU spends
|
||||||
most of the runtime of the `LOGICAL_INO` ioctl running the kernel,
|
most of the runtime of the `LOGICAL_INO` ioctl running the kernel,
|
||||||
so on a single-core CPU the entire system can freeze up for a few
|
so on a single-core CPU the entire system can freeze up for a second
|
||||||
seconds at a time.
|
during operations on toxic extents.
|
||||||
|
|
||||||
* Load managers that send a `SIGSTOP` to the bees process to throttle
|
|
||||||
CPU usage may affect the `LOGICAL_INO` timing mechanism, causing extents
|
|
||||||
to be incorrectly labelled 'toxic'. This will cause a small reduction
|
|
||||||
of dedupe hit rate. Slow and heavily loaded disks can trigger the same
|
|
||||||
effect if `LOGICAL_INO` takes too long due to IO latency.
|
|
||||||
|
|
||||||
* If a process holds a directory FD open, the subvol containing the
|
* If a process holds a directory FD open, the subvol containing the
|
||||||
directory cannot be deleted (`btrfs sub del` will start the deletion
|
directory cannot be deleted (`btrfs sub del` will start the deletion
|
||||||
|
Loading…
x
Reference in New Issue
Block a user