This should help clean up some of the uglier status outputs.
Supports:
* multi-line table cells
* character fills
* sparse tables
* insert, delete by row and column
* vertical separators
and not much else.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We can no longer reliably determine the number of hash table matches,
since we'll stop counting after the first one.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We were doing a `LOGICAL_INO` ioctl on every _block_ of a matching extent,
just to see how long it takes. It takes a while!
This could be modified to do an ioctl with the `IGNORE_OFFSET` flag,
once per new extent, but the kernel bug was fixed a long time ago, so
we can start removing all the toxic extent code.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When we have multiple possible matches for a block, we proceed in three
phases:
1. retrieve each match's extent refs and put them in a list,
2. iterate over the list converting viable block matches into range matches,
3. sort and flatten the list of range matches into a non-overlapping
list of ranges that cover all duplicate blocks exactly once.
The separation of phase 1 and 2 creates a performance issue when there
are many block matches in phase 1, and all the range matches in phase
2 are the same length. Even though we might quickly find the longest
possible matching range early in phase 2, we first extract all of the
extent refs from every possible matching block in phase 1, even though
most of those refs will never be used.
Fix this by moving the extent ref retrieval in phase 1 into a single
loop in phase 2, and stop looping over matching blocks as soon as any
dedupe range is created. This avoids iterating over a large list of
blocks with expensive `LOGICAL_INO` ioctls in an attempt to improve the
match when there is no hope of improvement, e.g. when all match ranges
are 4K and the content is extremely prevalent in the data.
If we find a matched block that is part of a short matching range,
we can replace it with a block that is part of a long matching range,
because there is a good chance we will find a matching hash block in
the long range by looking up hashes after the end of the short range.
In that case, overlapping dedupe ranges covering both blocks in the
target extent will be inserted into the dedupe list, and the longest
matches will be selected at phase 3. This usually provides a similar
result to that of the loop in phase 1, but _much_ more efficiently.
Some operations are left in phase 1, but they are all using internal
functions, not ioctls.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A laundry list of problems fixed:
* Track which physical blocks have been read recently without making
any changes, and don't read them again.
* Separate dedupe, split, and hole-punching operations into distinct
planning and execution phases.
* Keep the longest dedupe from overlapping dedupe matches, and flatten
them into non-overlapping operations.
* Don't scan extents that have blocks already in the hash table.
We can't (yet) touch such an extent without making unreachable space.
Let them go.
* Give better information in the scan summary visualization: show dedupe
range start and end points (<ddd>), matching blocks (=), copy blocks
(+), zero blocks (0), inserted blocks (.), unresolved match blocks
(M), should-have-been-inserted-but-for-some-reason-wasn't blocks (i),
and there's-a-bug-we-didn't-do-this-one blocks (#).
* Drop cached data from extents that have been inserted into the hash
table without modification.
* Rewrite the hole punching for uncompressed extents, which apparently
hasn't worked properly since the beginning.
Nuisance dedupe elimination:
* Don't do more than 100 dedupe, copy, or hole-punch operations per
extent ref.
* Don't split an extent or punch a hole unless dedupe would save at
least half of the extent ref's size.
* Write a "skip:" summary showing the planned work when nuisance
dedupe elimination decides to skip an extent.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add a master switch to turn off the entire MultiLock infrastructure for
testing, without having to remove and add all the individual entry points.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This prevents the storms of exceptions that occur when a subvol is
deleted. We simply treat the entire tree as if it was empty.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
try_lock allows specification of a different Task to be run instead of
the current Task when the lock is busy.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Originally the limit was 2730 (64KiB worth of ref pointers). This limit
was a little too low for some common workloads, so it was then raised by
a factor of 256 to 699050, but there are a lot of problems with extent
counts that large. Most of those problems are memory usage and speed
problems, but some of them trigger subtle kernel MM issues.
699050 references is too many to be practical. Set the limit to 9999,
only 3-4x larger than the original 2730, to give up on deduplication
when each deduped ref reduces the amount of space by no more than 0.01%.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This solves a third bad problem with bees reads:
3. The architecture above the read operations will issue read requests
for the same physical blocks over and over in a short period of time.
Fixing that properly requires rewriting the upper-level code, but a
simple small table of recent read requests can reduce the effect of the
problem by orders of magnitude.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This solves some of the worst problems with bees reads:
1. The kernel readahead doesn't work. More precisely, it's much better
adapted for a very different use case: a single thread alternating
between reading a file sequentially and processing the data that was read.
bees has multiple threads which compete for access to IO and then issue
reads in random order immediately after the call to readahead. The kernel
uses idle ioprio scheduling for the readaheads, so the readaheads get
preempted by the random reads, or cancels the readaheads because the
data access pattern isn't sequential after the readahead was issued.
2. Seeking drives perform terribly with multiple competing readers,
especially with btrfs striped profiles where the iops are broken into
tiny stripe-sized pieces. At one point I intended to read the btrfs
device map and figure out which devices can be read in parallel, but to
make that useful, the user needs to have an array with multiple drives
in single profile, or 4+ drives in raid1 profile. In all other cases,
the elaborate calculations always return the same result: there can be
only one reader at a time.
This commit fixes both problems:
1. Don't use the kernel readahead. Use normal reads into a dummy
buffer instead.
2. Allow only one thread to readahead at any time. Once the read is
completed, the data is in the page cache, and all the random-order small
reads that bees does will hit the page cache, not a spinning disk.
In some cases we need to read two things close together, so add a
`bees_readahead_pair` which holds one lock across both reads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The hash table is read sequentially and from a single thread, so
the kernel's implementation of readahead is appropriate here.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Commit c3b664fea54cfd8ac25411cbdb9536e4f24b008e ("context: don't forget
to retry locked extents") removed the critical return that prevents a
Task from processing an extent that is locked.
Put the return back.
Fixes: c3b664fea54cfd8ac25411cbdb9536e4f24b008e ("context: don't forget to retry locked extents")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The kernel has not required a 16 MiB limit on dedupe requests since
v4.18-rc1 b67287682688 ("Btrfs: dedupe_file_range ioctl: remove 16MiB
restriction").
Kernels before v4.18 would truncate the request and return the size
actually deduped in `bytes_deduped`. Kernel v4.18 and later will loop
in the kernel until the entire request is satisfied (although still
in 16 MiB chunks, so larger extents will be split).
Modify the loop in userspace to measure the size the kernel actually
deduped, instead of assuming the kernel will only accept 16 MiB.
On current kernels this will always loop exactly once.
Since we now rely on `bytes_deduped`, make sure it has a sane value.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These were added to crucible all the way back in 2018 (1beb61fb78ba
"crucible: error: record location of exception in what() message")
but it's even more useful in the stack tracer in bees.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Since we'll never process more than BEES_MAX_EXTENT_REF_COUNT extent
references by definition, it follows that we should not allocate buffer
space for them when we perform the LOGICAL_INO ioctl.
There is some evidence (particularly
https://github.com/Zygo/bees/issues/260#issuecomment-1627598058) that
the kernel is subjecting the page cache to a lot of disruption when
trying allocate large buffers for LOGICAL_INO.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Apparently reinterpret_cast<uint64_t> sign-extends 32-bit pointers.
This is OK when running on a 32-bit kernel that will truncate the pointer
to 32 bits, but when running on a 64-bit kernel, the extra bits are
interpreted as part of the (now very invalid) address.
Use <uintptr_t> instead, which is unsigned, integer, and the same word
size as the arch's pointer type. Ordinary numeric conversion can take
it from there, filling the rest of the word with zeros.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The bug is:
v6.3-rc6: f349b15e183d mm: vmalloc: avoid warn_alloc noise caused by fatal signal
The fixes are:
v6.4: 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
v6.3.10: c189994b5dd3 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
The bug has been backported to LTS, but the fix has not:
v6.2.11: 61334bc29781 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
v6.1.24: ef6bd8f64ce0 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
v5.15.107: a184df0de132 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There was a bug in kernel 6.3 where LOGICAL_INO with IGNORE_OFFSET
sometimes fails to ignore the offset. That bug is now fixed, but
LOGICAL_INO still returns 0 refs much more often than seems appropriate.
This is most likely because bees frequently deletes extents while there
is still work waiting for them in Task queues. In this case, LOGICAL_INO
correctly returns an empty list, because every reference to some extent
is deleted, but the new extent tree with that extent removed is not yet
committed in btrfs.
Add a DEBUG-level log message and an event counter to track these events.
In the absence of a kernel bug, the debug message may indicate CPU time
was wasted performing a search whose outcome could have been predicted.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Toxic extents are much less of a problem now than they were in kernels
before 5.7. Downgrade the log message level to reflect their lesser
importance.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The critical kernel bugs in send have been fixed for years.
The limitations that remain aren't bugs, and bees has no sustainable
workaround for them.
Also update copyright year range.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We check the result of transid_max_nocache(), but not the result of
transid_max(). The latter is a computed result that is even more likely
to be wrong[citation needed].
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
At least one user was significantly confused by "designed for large
filesystems".
The btrfs send workarounds aren't new any more.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Clarify that "too large" and "too small" are some distance away from each other.
The Goldilocks zone is _wide_.
The interval between cache drops is now shorter.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>