The cwd is where core dumps and various profiling and verification
libraries want to write their data, whereas root_fd is the root of the
target filesystem. These are often intentionally different. When
they are different, `--strip-paths` sets the wrong prefix to strip
from paths.
Once the root fd has been established, we can set the path prefix to
the string prefix that we'll get from future calls to `name_fd`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
bees explicitly supports storing $BEESHOME on another filesystem, and
does not require that filesystem to be btrfs; however, if $BEESHOME
is on a non-btrfs filesystem, there is an exception on every startup
when trying to identify the subvol root of the hash table file in order
to blacklist it, because non-btrfs filesystems don't have subvol roots.
Fix by checking not only whether $BEESHOME is on btrfs, but whether it
is on the _same_ btrfs, as the bees root, without throwing an exception.
The hash table is blacklisted only when both filesystems are btrfs and
have the same fsid.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Remove dubious comments and #if 0 section. Document new event counters,
and add one for read failures.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
1% is a lot of data on a petabyte filesystem, and a long time to wait for an
ETA.
After 1 GiB we should have some idea of how fast we're reading the data.
Increase the time to 10 seconds to avoid a nonsense result just after a scan
starts.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There's a pathological case where all of the extent scan crawlers except
one are at the end of a crawl cycle, but the one crawler that is still
running is keeping the Task queue full. The result is that bees never
starts the other extent scan crawlers, because the queue is always
full at the instant a new transid triggers the start of a new scan.
That's bad because it will result in bees falling behind when new data
from the inactive size tiers appears.
To fix this, check for throttling _after_ creating at least one scan task
in each crawler. That will keep the crawlers running, and possibly allow
them to claw back some space in the Task queue. It slightly overcommits
the Task queue, so there will be a few more Tasks than nominally allowed.
Also (re)introduce some hysteresis in the queue size limit and reduce it
a little, so that bees isn't continually stopping and restarting crawls
every time one task is created or completed, and so that we stay under
the configured Task limit despite overcommitting.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Toxic extent workarounds are going away because the underlying kernel
bugs have been fixed. They are no longer worthy of spamming non-developer
logs.
INO_PATHS can return no paths if an inode has been deleted. It doesn't
need a log message at all, much less one at WARN level.
Dedupe failure can be INFO, the same level as dedupe itself, especially
since the "NO dedupe" message doesn't mention what was [not] deduped.
Inspired by Kai Krakow's "context: demote "abandoned toxic match" to
debug log level".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This log message creates a overwhelmingly lot of messages in the system
journal, leading to write-back flushing storms under high activity. As
it is a work-around message, it is probably only useful to developers,
thus demote to debug level.
This fixes latency spikes in desktop usage after adding a lot of new
files, especially since systemd-journal starts to flush caches if it
sees memory pressure.
Signed-off-by: Kai Krakow <kai@kaishome.de>
next_transid tasks don't respect queue selection very well, because
they effectively end up spinning in a loop until all other worker
threads become busy.
Back this out, and fix the priority handling in the Task library.
This reverts commit 58db4071de5f524c35b1362bfb5b1fceedea503f.
This records the time when the progress data was calculated, to help
indicate when the data might be very old.
While we're here, move "now" out of the loop so there's only one value.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This increases resistance to symlink and mount attacks.
Previously, bees could follow a symlink or a mount point in a directory
component of a subvol or file name. Once the file is opened, the open
file descriptor would be checked to see if its subvol and inode matches
the expected file in the target filesystem. Files that fail to match
would be immediately closed.
With openat2 resolve flags, symlinks and mount points terminate path
resolution in the kernel. Paths that lead through symlinks or onto
mount points cannot be opened at all.
Fall back to openat() if openat2() returns ENOSYS, so bees will still
run on kernels before v5.6.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Since we're now using weak symbols for dodgy libc functions, we might
as well do it for gettid() too.
Use the ::gettid() global namespace and let libc override it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
"ctime", an abbreviation of "cycle time", collides with "ctime", an
abbreviation of "st_ctime", a well-known filesystem term.
"tm_left" fits in the column, so use that.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
* Report position within cycle in units that cannot be mistaken for size or percentage
* Put the total/maximum values in their own row
* Add a start time column
* Change column titles to reference "cycles"
* Use "idle" instead of "finished" when a crawler is not running
* Replace "transid" with "gen" because it's shorter
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The scanners which finish early can become stuck behind scanners that are
able to keep the queue full. Switch the next_transid task to the normal
Task queues so that we force scanners to restart on every new transaction,
possibly deferring already queued work to do so.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add yet another field to the scan/skip report line: the wallclock
time used to process the extent ref.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The total data size should not include metadata or system block groups,
and already does not; however, we still have these block groups in the map
for mapping the crawl pointer to a logical offset within the filesystem.
Rearrange a few lines around the `if` statement so that the map doesn't
contain anything it should not.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The progress indicator was failing on a mixed-bg filesystem because those
filesystems have block groups which have both _DATA and _METADATA bits,
and the filesystem size calculation was excluding block groups that have
_METADATA set. It should exclude block groups that have _DATA not set.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Running bees with no arguments complains about "Only one" path argument.
Replace this with "Exactly one" which uses similar terminology to other
btrfs tools.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
`getopt_long` already supplies a message when an option cannot be parsed,
so there isn't a need to distinguish option parse failures from help
requests.
Fixes: https://github.com/Zygo/bees/pull/277
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Longer latency testing runs are not showing a consistent gain from a
throttle factor of 1.0. Make the default more conservative.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Decaying averages by 10% every 5 minutes gives roughly a half-hour
half-life to the rolling average. Speed that up to once per minute.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We're not adding any more short options, but the debugging code doesn't
work with optvals above 255. Also clean up constness and variable
lifetimes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Measure the time spent running various operations that extend btrfs
transaction completion times (`LOGICAL_INO`, tmpfiles, and dedupe)
and arrange for each operation to run for not less than the average
amount of time by adding a sleep after each operation that takes less
than the average.
The delay after each operation is intended to slow down the rate of
deferred and long-running requests from bees to match the rate at which
btrfs is actually completing them. This may help avoid big spikes in
latency if btrfs has so many requests queued that it has to force a
commit to release memory.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Test machines keep blowing past the 32k file limit. 16 worker
threads at 10,000 files each is much larger than 32k.
Other high-FD-count services like DNS servers ask for million-file
rlimits.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
While a snapshot is being deleted, there will be a continuous stream of
"No ref for extent" messages. This is a common event that does not need
to be reported.
There is an analogous situation when a call to open() fails with ENOENT.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Dedupe is not possible on a subvol where a btrfs send is running:
BTRFS warning (device dm-22): cannot deduplicate to root 259417 while send operations are using it (1 in progress)
btrfs informs a process with EAGAIN that a dedupe could not be performed
due to a running send operation.
It would be possible to save the crawler state at the affected point,
fork a new crawler that avoids the subvol under send, and resume the
crawler state after a successful dedupe is detected; however, this only
helps the intersection of the set of users who have unrelated subvols
that don't share extents, and the set of users who cannot simply delay
dedupe until send is finished. The simplest approach is to simply stop
and wait until the send goes away.
The simplest approach is taken here. When a dedupe fails with EAGAIN,
affected Tasks will poll, approximately once per transaction, until the
dedupe succeeds or fails with a different error.
bees dedupe performance corresponds with the availability of subvols that
can accept dedupe requests. While the dedupe is paused, no new Tasks can
be performed by the worker thread. If subvols are small and isolated
from the bulk of the filesystem data, the result will be a small but
partial loss of dedupe performance during the send as some worker threads
get stuck on the sending subvol. If subvols heavily share extents with
duplicate data in other subvols, worker threads will all become blocked,
and the entire bees process will pause until at least some of the running
sends terminate.
During the polling for btrfs send, the dedupe Task will hold its dst
file open. This open FD won't interfere with snapshot or file delete
because send subvols are always read-only (it is not possible to delete
a file on a RO subvol, open or otherwise) and send itself holds the
affected subvol open, preventing its deletion. Once the send terminates,
the dedupe will terminate soon after, and the normal FD release can occur.
This pausing during btrfs send is unrelated to the
`--workaround-btrfs-send` option, although `--workaround-btrfs-send` will
cause the pausing to trigger less often. It applies to all scan modes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are no callers of this method any more, and it exposes more
of BeesRoots than we really want things to have access to.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
All callers of the `transid_max_nocache` method update `m_transid_re`
with the return value, so do that in `transid_max_nocache` itself.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The "done" pointer and the "%done" fields are still useful because they
indicate _actual_ progress, not the work that has been _promised_.
So it is possible for a crawl to be "finished" (all extents queued)
but not "100.0000%" (some of those extents still active or in the queue).
"deferred" state isn't particularly useful, so drop it.
"finished" state implies no ETA, so that column is unused.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
ETA is calculated using a sample obtained by snooping on bees's normal
crawling operations.
This sample is heavily biased and not representative of the entire
filesystem. If the distribution of extent sizes in the filesystem is
not uniform, the ETA can be wildly wrong.
Collecting an accurate sample set would require extra IO and CPU time
which should be spent doing dedupes instead.
Explicitly label the ETA as inaccurate to avoid having too many users
report the same bug.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These are simple on/off switches for the task queue. They are lightweight
requests for bees to be paused temporarily, but allow bees to release
open files and save progress while paused.
These signals are an alternative to SIGSTOP and SIGCONT, or using the
cgroup freezer's FROZEN and THAWED states, which pause and resume the
bees process, but do not allow the bees process to release open files
or save progress. Snapshot and file deletes can occur on the filesystem
while bees is paused by SIGUSR1 but not by SIGSTOP.
These signals are also an alternative to SIGTERM and restart, which
flush out the whole hash table and progress state on exit, and read
the whole table back into memory on restart.
This feature is experimental and may be replaced by a more general
configuration or runtime control mechanism in the future.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In some cases the offset and size arguments were flipped when checking to
see if a range had already been read. This would have been OK as long as
the same mistake had been made consistently, since `bees_readahead_check`
only does a cache lookup on the parameters, it doesn't try to use them to
read a file. Alas, there was one case where the correct order was used,
albeit a relatively rare one.
Fix all the calls to use the correct order.
Also fix a comment: the recent request cache is global to all threads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Although all the members of BtrfsExtentDataFetcher are theoretically
copiable, there's no need to actually make any such copy.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make it clearer where the progress information goes.
Also add placeholder text so the progress section isn't empty at startup,
when the progress hasn't been calculated yet.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Toxic extents are mostly gone in kernel 5.7 and later. Increase the
timeout for toxic extent handling to reduce false positives, and remove
persistenly stored toxic hashes from the hash table.
Toxic hashes are still stored nonpersistently to help mitigate problems
due to any remaining kernel bugs.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The serialization doesn't seem to be necessary for the extent scan mode.
No infinite loops in the kernel have been observed in the past two years,
despite never having used MultiLock for the extent scanner.
Leave the serialization for now on the subvol scanners.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We don't need the subvol numbers since they're only interesting to
developers.
We don't need both max and min sizes, pick one and drop the other.
Replace "16E" with "max"--it is the same number of characters, but
doesn't require the user to know what 1<<64 is off the top of their head.
Shorten "remain" to "todo" because sometimes those extra two columns
matter.
Drop the seconds field in ETA timestamps. Long scan arrival times are
years away, and short scan arrival times are only updated once every
5 minutes, so the extra precision isn't useful.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make the progress information more accessible, without having to
enable full debug log and fish it out of the stream with grep.
Also increase the progress log level to INFO.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are two crawl_maps in extent scan's next_transid: one gets
initialized, the other gets used. This works OK as long as bees is
resuming an existing scan, because the two maps are identical; however,
but it fails if bees is starting without an existing set of crawl data,
and one of the two maps is empty or partially filled.
The failure is intermittent, as the crawl map is being populated at
the same time next_transid runs. It will eventually be completed after
several transaction cycles, at which point bees runs normally.
It does add significant delays during startup for benchmarks.
There's only one crawl_map in extent scan, it always has the same
crawlers, and extent scan's `next_transid` creates it by itself.
Ignore the map from BeesRoots/BeesCrawl.
Also throw in some missing but helpful trace statements.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Replace pointers in the "done" and "total" columns with estimated data
sizes for each size tier. The estimation is based on statistics
collected from extents scanned during the current bees run.
Move the total size for the entire filesystem up to the heading.
Report the _completed_ position (i.e. the one that would be saved in
`beescrawl.dat`), not the _queued_ position (i.e. the one where the
next Task would be created in memory).
At the end of the data, the crawl pointer ends up at some random point
in the filesystem just after the newest extent, so the progress gets to
99.7% and then goes to some random value like 47% or 3%, not to 100%.
Report "deferred" in the "done" column when the crawler is waiting for
the next transid, and "finished" in the "%done" column when the crawler
has reached the end of the data. Suppress the ETA when finished. This
makes it clear that there's no further work to do for these crawlers.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesScanModeExtent uses six scan Tasks instead of one, which leads
to awkwardness like the do_scan method to tell crawl_roots how to do
what it shouldn't need to know how to do anyway.
Move the crawl_roots logic into the ::scan methods themselves.
This also deletes the very popular "crawl_more ran out of data" message.
Extent scan explicitly indicates when a scan is complete, so there's
no longer a need to fish this message out of the log.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The sorting avoids problematic read orders, like extent refs in the same
inode with descending offsets, that btrfs is not optimized for.
Putting everything in one Task keeps the queue sizes small, and
manages the lock contention much more calmly.
We only want to be mapping extent refs if there's not enough extents
already in the queue to keep worker threads busy, so use the `idle()`
method instead of `run()`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The EXTENT scan mode reads the extent tree, splits it into tiers by
extent size, converts each tiers's extents into subvol/inode/offset refs,
then runs the legacy bees dedupe engine on the refs.
The extent scan mode can cheaply compute completion percentage and ETA,
so do that every time a new transid is observed.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>