BeesScanModeExtent uses six scan Tasks instead of one, which leads
to awkwardness like the do_scan method to tell crawl_roots how to do
what it shouldn't need to know how to do anyway.
Move the crawl_roots logic into the ::scan methods themselves.
This also deletes the very popular "crawl_more ran out of data" message.
Extent scan explicitly indicates when a scan is complete, so there's
no longer a need to fish this message out of the log.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The sorting avoids problematic read orders, like extent refs in the same
inode with descending offsets, that btrfs is not optimized for.
Putting everything in one Task keeps the queue sizes small, and
manages the lock contention much more calmly.
We only want to be mapping extent refs if there's not enough extents
already in the queue to keep worker threads busy, so use the `idle()`
method instead of `run()`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The EXTENT scan mode reads the extent tree, splits it into tiers by
extent size, converts each tiers's extents into subvol/inode/offset refs,
then runs the legacy bees dedupe engine on the refs.
The extent scan mode can cheaply compute completion percentage and ETA,
so do that every time a new transid is observed.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We were doing a `LOGICAL_INO` ioctl on every _block_ of a matching extent,
just to see how long it takes. It takes a while!
This could be modified to do an ioctl with the `IGNORE_OFFSET` flag,
once per new extent, but the kernel bug was fixed a long time ago, so
we can start removing all the toxic extent code.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When we have multiple possible matches for a block, we proceed in three
phases:
1. retrieve each match's extent refs and put them in a list,
2. iterate over the list converting viable block matches into range matches,
3. sort and flatten the list of range matches into a non-overlapping
list of ranges that cover all duplicate blocks exactly once.
The separation of phase 1 and 2 creates a performance issue when there
are many block matches in phase 1, and all the range matches in phase
2 are the same length. Even though we might quickly find the longest
possible matching range early in phase 2, we first extract all of the
extent refs from every possible matching block in phase 1, even though
most of those refs will never be used.
Fix this by moving the extent ref retrieval in phase 1 into a single
loop in phase 2, and stop looping over matching blocks as soon as any
dedupe range is created. This avoids iterating over a large list of
blocks with expensive `LOGICAL_INO` ioctls in an attempt to improve the
match when there is no hope of improvement, e.g. when all match ranges
are 4K and the content is extremely prevalent in the data.
If we find a matched block that is part of a short matching range,
we can replace it with a block that is part of a long matching range,
because there is a good chance we will find a matching hash block in
the long range by looking up hashes after the end of the short range.
In that case, overlapping dedupe ranges covering both blocks in the
target extent will be inserted into the dedupe list, and the longest
matches will be selected at phase 3. This usually provides a similar
result to that of the loop in phase 1, but _much_ more efficiently.
Some operations are left in phase 1, but they are all using internal
functions, not ioctls.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A laundry list of problems fixed:
* Track which physical blocks have been read recently without making
any changes, and don't read them again.
* Separate dedupe, split, and hole-punching operations into distinct
planning and execution phases.
* Keep the longest dedupe from overlapping dedupe matches, and flatten
them into non-overlapping operations.
* Don't scan extents that have blocks already in the hash table.
We can't (yet) touch such an extent without making unreachable space.
Let them go.
* Give better information in the scan summary visualization: show dedupe
range start and end points (<ddd>), matching blocks (=), copy blocks
(+), zero blocks (0), inserted blocks (.), unresolved match blocks
(M), should-have-been-inserted-but-for-some-reason-wasn't blocks (i),
and there's-a-bug-we-didn't-do-this-one blocks (#).
* Drop cached data from extents that have been inserted into the hash
table without modification.
* Rewrite the hole punching for uncompressed extents, which apparently
hasn't worked properly since the beginning.
Nuisance dedupe elimination:
* Don't do more than 100 dedupe, copy, or hole-punch operations per
extent ref.
* Don't split an extent or punch a hole unless dedupe would save at
least half of the extent ref's size.
* Write a "skip:" summary showing the planned work when nuisance
dedupe elimination decides to skip an extent.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Originally the limit was 2730 (64KiB worth of ref pointers). This limit
was a little too low for some common workloads, so it was then raised by
a factor of 256 to 699050, but there are a lot of problems with extent
counts that large. Most of those problems are memory usage and speed
problems, but some of them trigger subtle kernel MM issues.
699050 references is too many to be practical. Set the limit to 9999,
only 3-4x larger than the original 2730, to give up on deduplication
when each deduped ref reduces the amount of space by no more than 0.01%.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This solves a third bad problem with bees reads:
3. The architecture above the read operations will issue read requests
for the same physical blocks over and over in a short period of time.
Fixing that properly requires rewriting the upper-level code, but a
simple small table of recent read requests can reduce the effect of the
problem by orders of magnitude.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This solves some of the worst problems with bees reads:
1. The kernel readahead doesn't work. More precisely, it's much better
adapted for a very different use case: a single thread alternating
between reading a file sequentially and processing the data that was read.
bees has multiple threads which compete for access to IO and then issue
reads in random order immediately after the call to readahead. The kernel
uses idle ioprio scheduling for the readaheads, so the readaheads get
preempted by the random reads, or cancels the readaheads because the
data access pattern isn't sequential after the readahead was issued.
2. Seeking drives perform terribly with multiple competing readers,
especially with btrfs striped profiles where the iops are broken into
tiny stripe-sized pieces. At one point I intended to read the btrfs
device map and figure out which devices can be read in parallel, but to
make that useful, the user needs to have an array with multiple drives
in single profile, or 4+ drives in raid1 profile. In all other cases,
the elaborate calculations always return the same result: there can be
only one reader at a time.
This commit fixes both problems:
1. Don't use the kernel readahead. Use normal reads into a dummy
buffer instead.
2. Allow only one thread to readahead at any time. Once the read is
completed, the data is in the page cache, and all the random-order small
reads that bees does will hit the page cache, not a spinning disk.
In some cases we need to read two things close together, so add a
`bees_readahead_pair` which holds one lock across both reads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The hash table is read sequentially and from a single thread, so
the kernel's implementation of readahead is appropriate here.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Commit c3b664fea54cfd8ac25411cbdb9536e4f24b008e ("context: don't forget
to retry locked extents") removed the critical return that prevents a
Task from processing an extent that is locked.
Put the return back.
Fixes: c3b664fea54cfd8ac25411cbdb9536e4f24b008e ("context: don't forget to retry locked extents")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These were added to crucible all the way back in 2018 (1beb61fb78ba
"crucible: error: record location of exception in what() message")
but it's even more useful in the stack tracer in bees.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Since we'll never process more than BEES_MAX_EXTENT_REF_COUNT extent
references by definition, it follows that we should not allocate buffer
space for them when we perform the LOGICAL_INO ioctl.
There is some evidence (particularly
https://github.com/Zygo/bees/issues/260#issuecomment-1627598058) that
the kernel is subjecting the page cache to a lot of disruption when
trying allocate large buffers for LOGICAL_INO.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There was a bug in kernel 6.3 where LOGICAL_INO with IGNORE_OFFSET
sometimes fails to ignore the offset. That bug is now fixed, but
LOGICAL_INO still returns 0 refs much more often than seems appropriate.
This is most likely because bees frequently deletes extents while there
is still work waiting for them in Task queues. In this case, LOGICAL_INO
correctly returns an empty list, because every reference to some extent
is deleted, but the new extent tree with that extent removed is not yet
committed in btrfs.
Add a DEBUG-level log message and an event counter to track these events.
In the absence of a kernel bug, the debug message may indicate CPU time
was wasted performing a search whose outcome could have been predicted.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Toxic extents are much less of a problem now than they were in kernels
before 5.7. Downgrade the log message level to reflect their lesser
importance.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We check the result of transid_max_nocache(), but not the result of
transid_max(). The latter is a computed result that is even more likely
to be wrong[citation needed].
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Each object contains a 16 MiB buffer, which is very heavy for some
malloc implementations.
Keep the objects in a Pool so that their buffers are only allocated and
deallocated once in the process lifetime.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If the send workaround is enabled, it is possible for two threads (a
thread running the crawl_new task, and a thread attempting to apply the
send workaround) to access the same RootFetcher object at the same time.
That never ends well.
Give each function its own BtrfsRootFetcher object.
Fixes: https://github.com/Zygo/bees/issues/250
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
With SIGTERM and fast exit, the trickle writeback is less important.
We don't want to flood people's IO subsystems with continuous writes.
This really should be configurable at runtime.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Do rebuild bees-version.cc if libcrucible changes.
Don't rebuild bees-version.cc if it doesn't change.
Also use the standard suffix for new files.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These tools are obsolete. fiemap was a thin wrapper around FIEMAP,
but FIEMAP is not useful on btrfs. fiewalk was a thin wrapper around
BtrfsExtentWalker, but development on BtrfsExtentWalker has been
abandoned.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When a hash table write fails, we skip over the write throttling because
we didn't report that we successfully wrote an extent. This can be bad
if the filesystem is full and the allocations for writes are burning a
lot of CPU time searching for free space.
We also don't retry the write later on since we assume the extent is
clean after a write attempt whether it was successful or not, so the
extent might not be written out later when writes are possible again.
Check whether a hash extent is dirty, and always throttle after
attempting the write.
If a write fails, leave the extent dirty so we attempt to write it out
the next time flush cycles through the hash table. During shutdown
this will reattempt each failing write once, after that the updated hash
table data will be dropped.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Calling 'bees -m4' should not call 'std::terminate()', but it does.
Use catch_all instead. It will still pass the exit value to return
from main.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BEESTOOLONG was always reporting a size of zero, and the offset of the
end of the readahead region. Report the original size instead (and also
in BEESTRACE and BEESNOTE).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Drop the crawl_restart counter, it doesn't happen here (or anywhere else).
Add the crawl_again counter for extents that are restarted due to an
extent-level lock.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
libcrucible can deal with the Linux kernel and/or libc's thread name
limitations. No need to duplicate that work in bees.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The caller of scan_forward has to stop advancing the BeesFileCrawl
position when an extent lock blocks a scan, so that it will resume
from the same position when the Task is scheduled again; otherwise,
bees simply skips over the extent and leave it incompletely deduped.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Restart crawl_more (and update crawl roots and flush FD caches) every
time the transid changes, and only when the transid changes, but
not more often than a reasonable minimum poll interval.
Clean up the log message: use the proper thread name and remove
the wildly inaccurate estimate of when crawl will resume.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We don't need to cache 65536 extent maps, especially if each one
can have almost 700K references.
Valgrind's massif tool points to the extent map cache as a very
large memory allocator, but test runs with memcg disagree.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we have loadavg targeting enabled, there may be no worker threads
available to respond to new subvols, so we should not bother updating
the subvols list.
Put insert_new_crawl into a Task so it only executes when a worker
is available.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
On large filesystems where the min_transid of all subvols gets stuck at 0,
bees may lose the ability to effectively track recent data. A secondary sort
by max_transid will allow scanning newer subvols that were created after bees
started running on the filesystem, but before bees completed the first scan
of all subvols.
On the other hand, the secondary sort does a reverse version of the
sequential scan mode, and the sequential scan mode is simply awful.
Disable it for now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Split each scan mode into two distinct phases:
1. A heavy discovery phase, where we search the entire filesystem
for something (new items in subvol trees in this case).
2. A light consuming phase, where we fetch extents to dedupe
from places that we found in the discovery phase.
Part 1 recomputes the subvol ordering every time there is a new transid.
For some scan modes this computation is quite expensive, far too costly
to pay for every extent, so we do it no more than once per transaction.
Part 2 is run every time a worker thread hits the crawl_more Task.
It simply pulls one extent from the first crawler off a sorted list,
removing the crawler from the list when the crawler runs out of data.
Part 1 creates a new structure and swaps it into place, while Part 2
continues to run using the previous strucuture. Neither of these
need to block the other, so they don't.
The separate class and base pointer also make it easer to add new scan
modes that are not based on subvol trees or that don't use BeesCrawl.
While we're here, fix up some method visibility in BeesRoots.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Set the constructor's default scan mode to an invalid mode, so if we
change the default, we don't have to update two places.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Crawl mode 3 'recent' prioritizes data from new updates to previously
scanned subvols over subvols that have not been completely scanned yet.
If no such new data exists, falls back to a variation of 'lockstep'
scan mode.
This enables us to keep up with new data as it arrives, a key weakness
of all the other scan modes, and worth violating our unwritten "no new
scan modes until we have extent-tree dedupe working" policy for.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Inode-oriented scan workers must do all of their work sequentially,
so it's counterproductive to spawn a Task to do a background dedupe.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When two Tasks attempt to lock the same extent, append the later Task
to the earlier Task's post-exec work queue. This will guarantee that
all Tasks which attempt to manipulate the same extent will execute
sequentially, and free up threads to process other extents.
Similarly, if two scanner threads operate on the same inode, any dedupe
they perform will lock out other scanner threads in btrfs. Avoid this
by serializing Task objects that reference the same file.
This does theoretically use an unbounded amount of memory, but in practice
a Task that encounters a contended extent or inode quickly stops spawning
new Tasks that might increase the queue size, and all Tasks that might
contend for the same lock(s) end up on a single FIFO queue.
Note that the scope of inode locks is intentionally global, i.e. when
an inode is locked, it locks every inode with the same number in every
subvol. This avoids significant lock contention and task queue growth
when the same inode with the same file extents appear in snapshots.
Fixes: https://github.com/Zygo/bees/issues/158
Signed-off-by: Zygo Blaxell <bees@furryterror.org>