Toxic extents are mostly gone in kernel 5.7 and later. Increase the
timeout for toxic extent handling to reduce false positives, and remove
persistenly stored toxic hashes from the hash table.
Toxic hashes are still stored nonpersistently to help mitigate problems
due to any remaining kernel bugs.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The serialization doesn't seem to be necessary for the extent scan mode.
No infinite loops in the kernel have been observed in the past two years,
despite never having used MultiLock for the extent scanner.
Leave the serialization for now on the subvol scanners.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We don't need the subvol numbers since they're only interesting to
developers.
We don't need both max and min sizes, pick one and drop the other.
Replace "16E" with "max"--it is the same number of characters, but
doesn't require the user to know what 1<<64 is off the top of their head.
Shorten "remain" to "todo" because sometimes those extra two columns
matter.
Drop the seconds field in ETA timestamps. Long scan arrival times are
years away, and short scan arrival times are only updated once every
5 minutes, so the extra precision isn't useful.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make the progress information more accessible, without having to
enable full debug log and fish it out of the stream with grep.
Also increase the progress log level to INFO.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are two crawl_maps in extent scan's next_transid: one gets
initialized, the other gets used. This works OK as long as bees is
resuming an existing scan, because the two maps are identical; however,
but it fails if bees is starting without an existing set of crawl data,
and one of the two maps is empty or partially filled.
The failure is intermittent, as the crawl map is being populated at
the same time next_transid runs. It will eventually be completed after
several transaction cycles, at which point bees runs normally.
It does add significant delays during startup for benchmarks.
There's only one crawl_map in extent scan, it always has the same
crawlers, and extent scan's `next_transid` creates it by itself.
Ignore the map from BeesRoots/BeesCrawl.
Also throw in some missing but helpful trace statements.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Replace pointers in the "done" and "total" columns with estimated data
sizes for each size tier. The estimation is based on statistics
collected from extents scanned during the current bees run.
Move the total size for the entire filesystem up to the heading.
Report the _completed_ position (i.e. the one that would be saved in
`beescrawl.dat`), not the _queued_ position (i.e. the one where the
next Task would be created in memory).
At the end of the data, the crawl pointer ends up at some random point
in the filesystem just after the newest extent, so the progress gets to
99.7% and then goes to some random value like 47% or 3%, not to 100%.
Report "deferred" in the "done" column when the crawler is waiting for
the next transid, and "finished" in the "%done" column when the crawler
has reached the end of the data. Suppress the ETA when finished. This
makes it clear that there's no further work to do for these crawlers.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesScanModeExtent uses six scan Tasks instead of one, which leads
to awkwardness like the do_scan method to tell crawl_roots how to do
what it shouldn't need to know how to do anyway.
Move the crawl_roots logic into the ::scan methods themselves.
This also deletes the very popular "crawl_more ran out of data" message.
Extent scan explicitly indicates when a scan is complete, so there's
no longer a need to fish this message out of the log.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The sorting avoids problematic read orders, like extent refs in the same
inode with descending offsets, that btrfs is not optimized for.
Putting everything in one Task keeps the queue sizes small, and
manages the lock contention much more calmly.
We only want to be mapping extent refs if there's not enough extents
already in the queue to keep worker threads busy, so use the `idle()`
method instead of `run()`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The EXTENT scan mode reads the extent tree, splits it into tiers by
extent size, converts each tiers's extents into subvol/inode/offset refs,
then runs the legacy bees dedupe engine on the refs.
The extent scan mode can cheaply compute completion percentage and ETA,
so do that every time a new transid is observed.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We were doing a `LOGICAL_INO` ioctl on every _block_ of a matching extent,
just to see how long it takes. It takes a while!
This could be modified to do an ioctl with the `IGNORE_OFFSET` flag,
once per new extent, but the kernel bug was fixed a long time ago, so
we can start removing all the toxic extent code.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When we have multiple possible matches for a block, we proceed in three
phases:
1. retrieve each match's extent refs and put them in a list,
2. iterate over the list converting viable block matches into range matches,
3. sort and flatten the list of range matches into a non-overlapping
list of ranges that cover all duplicate blocks exactly once.
The separation of phase 1 and 2 creates a performance issue when there
are many block matches in phase 1, and all the range matches in phase
2 are the same length. Even though we might quickly find the longest
possible matching range early in phase 2, we first extract all of the
extent refs from every possible matching block in phase 1, even though
most of those refs will never be used.
Fix this by moving the extent ref retrieval in phase 1 into a single
loop in phase 2, and stop looping over matching blocks as soon as any
dedupe range is created. This avoids iterating over a large list of
blocks with expensive `LOGICAL_INO` ioctls in an attempt to improve the
match when there is no hope of improvement, e.g. when all match ranges
are 4K and the content is extremely prevalent in the data.
If we find a matched block that is part of a short matching range,
we can replace it with a block that is part of a long matching range,
because there is a good chance we will find a matching hash block in
the long range by looking up hashes after the end of the short range.
In that case, overlapping dedupe ranges covering both blocks in the
target extent will be inserted into the dedupe list, and the longest
matches will be selected at phase 3. This usually provides a similar
result to that of the loop in phase 1, but _much_ more efficiently.
Some operations are left in phase 1, but they are all using internal
functions, not ioctls.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A laundry list of problems fixed:
* Track which physical blocks have been read recently without making
any changes, and don't read them again.
* Separate dedupe, split, and hole-punching operations into distinct
planning and execution phases.
* Keep the longest dedupe from overlapping dedupe matches, and flatten
them into non-overlapping operations.
* Don't scan extents that have blocks already in the hash table.
We can't (yet) touch such an extent without making unreachable space.
Let them go.
* Give better information in the scan summary visualization: show dedupe
range start and end points (<ddd>), matching blocks (=), copy blocks
(+), zero blocks (0), inserted blocks (.), unresolved match blocks
(M), should-have-been-inserted-but-for-some-reason-wasn't blocks (i),
and there's-a-bug-we-didn't-do-this-one blocks (#).
* Drop cached data from extents that have been inserted into the hash
table without modification.
* Rewrite the hole punching for uncompressed extents, which apparently
hasn't worked properly since the beginning.
Nuisance dedupe elimination:
* Don't do more than 100 dedupe, copy, or hole-punch operations per
extent ref.
* Don't split an extent or punch a hole unless dedupe would save at
least half of the extent ref's size.
* Write a "skip:" summary showing the planned work when nuisance
dedupe elimination decides to skip an extent.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Originally the limit was 2730 (64KiB worth of ref pointers). This limit
was a little too low for some common workloads, so it was then raised by
a factor of 256 to 699050, but there are a lot of problems with extent
counts that large. Most of those problems are memory usage and speed
problems, but some of them trigger subtle kernel MM issues.
699050 references is too many to be practical. Set the limit to 9999,
only 3-4x larger than the original 2730, to give up on deduplication
when each deduped ref reduces the amount of space by no more than 0.01%.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This solves a third bad problem with bees reads:
3. The architecture above the read operations will issue read requests
for the same physical blocks over and over in a short period of time.
Fixing that properly requires rewriting the upper-level code, but a
simple small table of recent read requests can reduce the effect of the
problem by orders of magnitude.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This solves some of the worst problems with bees reads:
1. The kernel readahead doesn't work. More precisely, it's much better
adapted for a very different use case: a single thread alternating
between reading a file sequentially and processing the data that was read.
bees has multiple threads which compete for access to IO and then issue
reads in random order immediately after the call to readahead. The kernel
uses idle ioprio scheduling for the readaheads, so the readaheads get
preempted by the random reads, or cancels the readaheads because the
data access pattern isn't sequential after the readahead was issued.
2. Seeking drives perform terribly with multiple competing readers,
especially with btrfs striped profiles where the iops are broken into
tiny stripe-sized pieces. At one point I intended to read the btrfs
device map and figure out which devices can be read in parallel, but to
make that useful, the user needs to have an array with multiple drives
in single profile, or 4+ drives in raid1 profile. In all other cases,
the elaborate calculations always return the same result: there can be
only one reader at a time.
This commit fixes both problems:
1. Don't use the kernel readahead. Use normal reads into a dummy
buffer instead.
2. Allow only one thread to readahead at any time. Once the read is
completed, the data is in the page cache, and all the random-order small
reads that bees does will hit the page cache, not a spinning disk.
In some cases we need to read two things close together, so add a
`bees_readahead_pair` which holds one lock across both reads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The hash table is read sequentially and from a single thread, so
the kernel's implementation of readahead is appropriate here.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Commit c3b664fea54cfd8ac25411cbdb9536e4f24b008e ("context: don't forget
to retry locked extents") removed the critical return that prevents a
Task from processing an extent that is locked.
Put the return back.
Fixes: c3b664fea54cfd8ac25411cbdb9536e4f24b008e ("context: don't forget to retry locked extents")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These were added to crucible all the way back in 2018 (1beb61fb78ba
"crucible: error: record location of exception in what() message")
but it's even more useful in the stack tracer in bees.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Since we'll never process more than BEES_MAX_EXTENT_REF_COUNT extent
references by definition, it follows that we should not allocate buffer
space for them when we perform the LOGICAL_INO ioctl.
There is some evidence (particularly
https://github.com/Zygo/bees/issues/260#issuecomment-1627598058) that
the kernel is subjecting the page cache to a lot of disruption when
trying allocate large buffers for LOGICAL_INO.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There was a bug in kernel 6.3 where LOGICAL_INO with IGNORE_OFFSET
sometimes fails to ignore the offset. That bug is now fixed, but
LOGICAL_INO still returns 0 refs much more often than seems appropriate.
This is most likely because bees frequently deletes extents while there
is still work waiting for them in Task queues. In this case, LOGICAL_INO
correctly returns an empty list, because every reference to some extent
is deleted, but the new extent tree with that extent removed is not yet
committed in btrfs.
Add a DEBUG-level log message and an event counter to track these events.
In the absence of a kernel bug, the debug message may indicate CPU time
was wasted performing a search whose outcome could have been predicted.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Toxic extents are much less of a problem now than they were in kernels
before 5.7. Downgrade the log message level to reflect their lesser
importance.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We check the result of transid_max_nocache(), but not the result of
transid_max(). The latter is a computed result that is even more likely
to be wrong[citation needed].
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Each object contains a 16 MiB buffer, which is very heavy for some
malloc implementations.
Keep the objects in a Pool so that their buffers are only allocated and
deallocated once in the process lifetime.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If the send workaround is enabled, it is possible for two threads (a
thread running the crawl_new task, and a thread attempting to apply the
send workaround) to access the same RootFetcher object at the same time.
That never ends well.
Give each function its own BtrfsRootFetcher object.
Fixes: https://github.com/Zygo/bees/issues/250
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
With SIGTERM and fast exit, the trickle writeback is less important.
We don't want to flood people's IO subsystems with continuous writes.
This really should be configurable at runtime.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Do rebuild bees-version.cc if libcrucible changes.
Don't rebuild bees-version.cc if it doesn't change.
Also use the standard suffix for new files.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These tools are obsolete. fiemap was a thin wrapper around FIEMAP,
but FIEMAP is not useful on btrfs. fiewalk was a thin wrapper around
BtrfsExtentWalker, but development on BtrfsExtentWalker has been
abandoned.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When a hash table write fails, we skip over the write throttling because
we didn't report that we successfully wrote an extent. This can be bad
if the filesystem is full and the allocations for writes are burning a
lot of CPU time searching for free space.
We also don't retry the write later on since we assume the extent is
clean after a write attempt whether it was successful or not, so the
extent might not be written out later when writes are possible again.
Check whether a hash extent is dirty, and always throttle after
attempting the write.
If a write fails, leave the extent dirty so we attempt to write it out
the next time flush cycles through the hash table. During shutdown
this will reattempt each failing write once, after that the updated hash
table data will be dropped.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Calling 'bees -m4' should not call 'std::terminate()', but it does.
Use catch_all instead. It will still pass the exit value to return
from main.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BEESTOOLONG was always reporting a size of zero, and the offset of the
end of the readahead region. Report the original size instead (and also
in BEESTRACE and BEESNOTE).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Drop the crawl_restart counter, it doesn't happen here (or anywhere else).
Add the crawl_again counter for extents that are restarted due to an
extent-level lock.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
libcrucible can deal with the Linux kernel and/or libc's thread name
limitations. No need to duplicate that work in bees.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The caller of scan_forward has to stop advancing the BeesFileCrawl
position when an extent lock blocks a scan, so that it will resume
from the same position when the Task is scheduled again; otherwise,
bees simply skips over the extent and leave it incompletely deduped.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Restart crawl_more (and update crawl roots and flush FD caches) every
time the transid changes, and only when the transid changes, but
not more often than a reasonable minimum poll interval.
Clean up the log message: use the proper thread name and remove
the wildly inaccurate estimate of when crawl will resume.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We don't need to cache 65536 extent maps, especially if each one
can have almost 700K references.
Valgrind's massif tool points to the extent map cache as a very
large memory allocator, but test runs with memcg disagree.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we have loadavg targeting enabled, there may be no worker threads
available to respond to new subvols, so we should not bother updating
the subvols list.
Put insert_new_crawl into a Task so it only executes when a worker
is available.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
On large filesystems where the min_transid of all subvols gets stuck at 0,
bees may lose the ability to effectively track recent data. A secondary sort
by max_transid will allow scanning newer subvols that were created after bees
started running on the filesystem, but before bees completed the first scan
of all subvols.
On the other hand, the secondary sort does a reverse version of the
sequential scan mode, and the sequential scan mode is simply awful.
Disable it for now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>