1
0
mirror of https://github.com/Zygo/bees.git synced 2025-05-17 21:35:45 +02:00

802 Commits

Author SHA1 Message Date
Steven Allen
a844024395
Make the runtime directory private
The status file contains sensitive information like filenames and duplicate chunk ranges. It might also make sense to set the process-wide `UMask=`, but that may have other unintended side effects.
2025-03-26 15:02:42 +00:00
Zygo Blaxell
47243aef14 hash: handle $BEESHOME on btrfs too
The `_nothrow` variants of `do_ioctl` return true when they succeed,
which is the opposite of what `ioctl` does.

Fix the logic so bees can correctly identify its own hash table when
it's on the same filesystem as the target.

Fixes: f6908420adcf3fbafef8bbde8a25f8bac65424b6 ("hash: handle $BEESHOME on non-btrfs")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-17 21:18:08 -05:00
Zygo Blaxell
a670aa5a71 extent scan: don't divide by zero if there were no loops
Commit 183b6a5361e040d7cc70c3b3391f43e0d126bc33 ("extent scan: refactor
BeesCrawl, BeesScanMode*") moved some statistics calculations out of
the loop in `find_next_extent`, but did not ensure that the statistics
would not be calculated if the loop had not executed any iterations.

In rare instances, the function returns without entering the loop at all,
which results in divide by zero.  Add a check just before doing that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
51b3bcdbe4 trace: deprecate BEESLOGTRACE, align trace logs with exception notices
Exceptions were logged at level NOTICE while the stack traces were logged
at level DEBUG.  That produced useless noise in the output with `-v5`
or `-v6`, where there were exception headings logged, but no details.

Fix that by placing the exceptions and traces at level DEBUG, but prefix
them with `TRACE:` for easy grepping.

Most of the events associated with BEESLOGTRACE either never happen,
or they are harmless (e.g. trying to open deleted files or subvols).
Reassign them to ordinary BEESLOGDEBUG, with one exception for
unrecognized Extent flags that should be debugged if any appear.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
ae58401d53 trace: avoid one copy in every trace function
While investigating https://github.com/Zygo/bees/issues/282 I noticed that
we're doing at least one unnecessary extra copy of the functor in BEESTRACE.
Get rid of it with a const reference.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
3e7eb43b51 BeesStringFile: figure out when to call--or _not_ call--fsync
Older kernel versions featured some bugs in btrfs `fsync`, which could
leave behind "ghost dirents", orphan filename items that did not have
a corresponding inode.  These dirents were created during log replay
during the first mount after a crash due to several different bugs in
the log tree and its use over the years.  The last known bug of this
kind was fixed in kernel 5.16.  As of this writing, no fixes for this
bug have been backported to any earlier LTS kernel.

Some filesystems, including btrfs, will flush the contents of a new
file before renaming it over an old file.  On paper, btrfs can do this
very cheaply since the contents of the new file are not referenced, and
the old file not dereferenced, until a tree commit which includes both
actions atomically; however, in real life, btrfs provides `fsync`-like
semantics and uses the log-tree infrastructure to implement them, which
compromises performance and acts as a magnet for bugs.

The benefit of this trade-off is that `rename` can be used as a
synchronization point for data outside of the btrfs, which would not
happen if everything `rename` does was simply deferred to the next
tree commit.  The cost of this trade-off is that for the first 8 years
of its existence, bees would trigger the bug so often that the project
recommended its users put $BEESHOME in its own subvol to make it easy
to remove ghost dirents left behind by the bug.

Some other filesystems, such as xfs, don't have any special semantics for
`rename`, and require `fsync` to avoid garbage or missing data after
a crash.  Even filesystems which do have a special case for `rename`
can be configured to turn it off.

btrfs will silently delete data from files in the event that an
unrecoverable data block write error occurs.  Kernel version 6.2 adds
important new and unexpected cases where this can happen on filesystems
using raid56 data, but it also happens in all usable btrfs versions
(the silent deletion behavior was introduced in kernel version 3.9).

Unrecoverable write errors are currently reported to userspace only
through `fsync`.  Since the failed extents are deleted, they cannot be
detected via csum failures or scrub after the fact--and it's too late
by then, the data is already gone.  `fsync` is the last opportunity
to detect the write failure before the `rename`.  If the error is not
detected, the contents of the file will be silently discarded in btrfs.
The impact on bees is that scans will abruptly restart from zero after
a crash combined with some other reasonably common failures.

Putting all of this together leads to a rather complex workaround:
if the filesystem under $BEESHOME (specifically, the filesystem where
BeesStringFile objects such as `beescrawl.dat` are written) is a btrfs
filesystem, and the host kernel is a version prior to 5.16, then don't
call `fsync` before `rename`.  In all other cases, do call `fsync`,
and prevent dependent writes (i.e. the following `rename`) in the event
of errors.

Since present kernel versions still require `fsync`, we don't need
an upper bound on the kernel version check until someone fixes btrfs
`rename` (or perhaps adds a flag to `renameat2` which prevents use of
the log tree) in the kernel.  Once that fix happens, we can drop the
`fsync` call for kernels after that fixed version.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-10 21:04:20 -05:00
Zygo Blaxell
962d94567c hexdump: fix pointer cast const mismatch
Another hit from the exotic compiler collection:  build fails on GCC 9,
from Ubuntu 20...but not later versions of GCC.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-10 21:00:31 -05:00
Zygo Blaxell
6dbef5f27b fs: improve compatibility with linux-libc-dev 5.4
Fix the missing symbols that popped up when adding chunk tree to
lib/fs.cc.  Also define the missing symbols instead of merely trying to
avoid them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-08 21:17:15 -05:00
Zygo Blaxell
88b1e4ca6e main: unconditionally enable workaround for the logical_ino-vs-clone kernel bug
This obviously doesn't fix or prevent the kernel bug, but it does prevent
bees from triggering the bug without assitance from another application.

The bug can still be triggered by running bees at the same time as an
application which uses clone or LOGICAL_INO.  `btdu` uses LOGICAL_INO,
while `cp` from coreutils (and many others) use clone (reflink copy).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
v0.11-rc4
2025-02-06 23:14:16 -05:00
Zygo Blaxell
c1d7fa13a5 roots: drop unnecessary mutex unlock in stop_request
In commit 31b2aa3c0dcf042559fa495b63a1bf22bac4d55d ("context: speed
up orderly process termination"), the stop request was split into two
methods after the mutex unlock.

Now that there's nothing after the mutex unlock in `stop_request`,
there's no need for an explicit unlock to do what the destructor would
have done anyway.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 23:14:16 -05:00
Zygo Blaxell
aa39bddb2d extent scan: implement an experimental ordered scan mode
Parallel scan runs each extent size tier in a separate thread.  The
threads compete to process extents within the tier's size range.

Ordered scan processes each extent size tier completely before moving on
to the next.  In theory, this means large extents always get processed
quickly, especially when new ones appear, and the queue does not fill up
with small extents.

In practice, the multi-threaded scanner massively outperforms the
single-threaded scanner, unless the number of worker threads is very
small (i.e. one).

Disable most of the feature for now, but leave the code in place so it
can be easily reactivated for future testing.

Ordered scan introduces a parallelized extent mapper Task.  Keep that in
parallel scan mode, which further enhances the parallelism.  The extent
scan crawl threads now run at 'idle' priority while the map tasks run
at normal priority, so the map tasks don't flood the task queue.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 23:14:16 -05:00
Zygo Blaxell
1aea2d2f96 crawl: deprecate use of BeesCrawl to search the extent tree
BeesScanModeExtent can do that by itself now.  Overloading the subvol
crawl code resulted in an ugly, inefficient hack, and we definitely
don't want to accidentally continue to use it.

Remove the support for reading the extent tree and add some `assert`s
to make sure it isn't still used somewhere.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
673b450671 docs: update event counters after extent scan refactoring and crawl skipping
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
183b6a5361 extent scan: refactor BeesCrawl, BeesScanMode*
The main gains here are:

* Move extent tree searches into BeesScanModeExtent so that they are
not slowed down by the BeesCrawl code, which was designed for the
much more specialized metadata in subvol trees.
* Enable short extent skipping now that BeesCrawl is out of the way.
* Stop enumerating btrfs subvols when in extent scan mode.

All this gets rid of >99% of unnecessary extent tree searches.
Incremental extent scan cycles now finish in milliseconds instead
of minutes.

BeesCrawl was never designed to cope with the structure and content of
the extent tree.  It would waste thousands of tree-search ioctl calls
reading and ignoring metadata items.

Performance was particularly bad when a binary search was involved, as any
binary search probe that landed in a metadata block group would read and
discard all the metadata items in the block group, sequentially, repeated
for each level of the binary search.  This was blocking implementation of
short extent skipping optimization for large extent size tiers, because
the skips were using thousands of tree searches to skip over only a few
hundred extent items.

Extent scan also had to read every extent item twice to do the
transid filtering, because BeesCrawl's interface discarded the relevant
information when it converted a `BtrfsTreeItem` into a `BeesFileRange`.
The cost of this extra fetch was negligible, but it could have been zero.

Fix this by:

* Copy the equivalent of `fetch_extents` from BeesCrawl into
`BeesScanModeExtent`, then give each of the extent scan crawlers its
own `BtrfsDataExtentTreeFetcher` instance.  This enables extent tree
searches to avoid pure (non-mixed) metadata block groups.  `BeesCrawl`
is now used only for its interface to `BeesRoots` for saving state in
`beescrawl.dat`, and never to determine the next extent tree item.

* Move subvol-specific parts of `BeesRoots` into a new class
`BeesScanModeSubvol` so that `BtrfsScanModeExtent` doesn't have to enable
or support them.  In particular, `bees -m4` no longer enumerates all
of the _subvol_ crawlers.  `BeesRoots` is still used to save and load
crawl state.

* Move several members from `BtrfsScanModeExtent` into a per-crawler
state object `SizeTier` to eliminate the need for some locks and to
maintain separate cache state for `BtrfsDataExtentTreeFetcher`.

* Reuse the `BtrfsTreeItem` to get the generation field for the transid
range filter.

* Avoid a few corner cases when handling errors, where extent scan might
drop an extent without scanning it, or fail to advance to the next extent.

* Enable the extent-skipping algorithm for large size tiers, now that
`BeesCrawl::fetch_extents` is no longer slowing it down.

* Add a debug stream interface which developers can easily turn on when
needed to inspect the decisions that extent scan is making.

* Track metrics that are more useful, particularly searches per extent
scanned, and fraction of extents that are skipped.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
b6446d7316 roots: rework open_root_nocache to use btrfs-tree
This gets rid of one open-coded btrfs tree search.

Also reduce the log noise level for subvol open failures, and remove
some ancient references to `BEESLOG`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
d32f31f411 btrfs-tree: harden rlower_bound against exceptional objects
Rearrange the logic in `rlower_bound` so it can cope with a tree
that contains mostly block-aligned objects, with a few exceptions
filtered out by `hdr_stop`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
dd08f6379f btrfs-tree: add a method to get root backref items to BtrfsRootFetcher
This complements the already existing support for reading the fields of
a root backref.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
58ee297cde btrfs-tree: connect methods to the debug stream interface
In some cases functions already had existing debug stream support
which can be redirected to the new interface.  In other cases, new
debug messages are added.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
a3c0ba0d69 fs: add a runtime debug stream for btrfs tree searches
This allows plugging in an ostream at run time so that we can audit all
the search calls we are doing.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
75040789c6 btrfs-tree: drop BtrfsFsTreeFetcher and clean up class comments
BtrfsFsTreeFetcher was used for early versions of the extent scanner, but
neither subvol nor extent scan now needs an object that is both persistent
and configured to access only one subvol.  BtrfsExtentDataFetcher does
the same thing in that case.

Clarify the comments on what the remaining classes do, so that
BtrfsFsTreeFetcher doesn't get inadvertently reinvented in the future.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
f9a697518d btrfs-tree: introduce BtrfsDataExtentTreeFetcher to read data extents without metadata
Binary searches can be extremely slow if the target bytenr is near a
metadata block group, because metadata items are not visible to the
binary search algorithm.  In a non-mixed-bg filesystem, there can be
hundreds of thousands of metadata items between data extent items, and
since the binary search algorithm can't see them, it will run searches
that iterate over hundreds of thousands of objects about a dozen times.

This is less of a problem for mixed-bg filesystems because the data and
metadata blocks are not isolated from each other.  The binary search
algorithm still can't see the metadata items, but there are usually
some data items close by to prevent the linear item filter from running
too long.

Introduce a new fetcher class (all the good names were taken) that tracks
where the end of the current block group is.  When the end of the current
block group is reached in the linear search, skip ahead to a block group
that can contain data items.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
c4ba6ec269 fs: add a ntoa function for chunk types
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
440740201a main: the base directory for --strip-paths should be root_fd, not cwd
The cwd is where core dumps and various profiling and verification
libraries want to write their data, whereas root_fd is the root of the
target filesystem.  These are often intentionally different.  When
they are different, `--strip-paths` sets the wrong prefix to strip
from paths.

Once the root fd has been established, we can set the path prefix to
the string prefix that we'll get from future calls to `name_fd`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
f6908420ad hash: handle $BEESHOME on non-btrfs
bees explicitly supports storing $BEESHOME on another filesystem, and
does not require that filesystem to be btrfs; however, if $BEESHOME
is on a non-btrfs filesystem, there is an exception on every startup
when trying to identify the subvol root of the hash table file in order
to blacklist it, because non-btrfs filesystems don't have subvol roots.

Fix by checking not only whether $BEESHOME is on btrfs, but whether it
is on the _same_ btrfs, as the bees root, without throwing an exception.
The hash table is blacklisted only when both filesystems are btrfs and
have the same fsid.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
925b12823e fs: add do_ioctl_nothrow and fsid methods to btrfs fs info
Enable use of the ioctl to probe whether two fds refer to the same btrfs,
without throwing an exception.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
561e604edc seeker: turn off debug logging
The debug log is only revealed when something goes wrong, but it is
created and discarded every time `seek_backward` is called, and it
is quite CPU-intensive.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
30cd375d03 readahead: clean up the code, update docs
Remove dubious comments and #if 0 section.  Document new event counters,
and add one for read failures.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
48b7fbda9c progress: adjust minimum thresholds for ETA to 10 seconds and 1 GiB of data
1% is a lot of data on a petabyte filesystem, and a long time to wait for an
ETA.

After 1 GiB we should have some idea of how fast we're reading the data.
Increase the time to 10 seconds to avoid a nonsense result just after a scan
starts.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
85aba7b695 openat2: #include <linux/types.h> so we can know __u64
Alternative implementations could use `uint64_t` instead, from `cstdint`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 17:02:19 -05:00
Zygo Blaxell
de38b46dd8 scripts/beesd: harden the mount options
* `nodev`: This reduces rename attack surface by preventing bees from
 opening any device file on the target filesystem.

 * `noexec`: This prevents access to the mount point from being leveraged
 to execute setuid binaries, or execute anything at all through the
 mount point.

These options are not required because they duplicate features in the
bees binary (assuming that the mount namespace remains private):

 * `noatime`: bees always opens every file with `O_NOATIME`, making
 this option redundant.

 * `nosymfollow`: bees uses `openat2` on kernels 5.6 and later with
 flags that prevent symlink attacks.  `nosymfollow` was introduced in
 kernel 5.10, so every kernel that can do `nosymfollow` can already do
 `openat2`.  Also, historically, `$BEESHOME` can be a relative path with
 symlinks in any path component except the last one, and `nosymfollow`
 doesn't allow that.

Between `openat2` and `nodev`, all symlink attacks are prevented, and
rename attacks cannot be used to force bees to open a device file.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 01:00:41 -05:00
Zygo Blaxell
0abf6ebb3d scripts/beesd: no need for $BEESHOME to be a subvol
We _recommend_ that `$BEESHOME` should be a subvol, and we'll create a
subvol if no directory exists; however, there's no reason to reject an
existing plain directory if the user chooses to use one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 00:43:13 -05:00
Kai Krakow
360ce7e125 scripts/beesd: Unshare namespace without systemd
If starting the beesd script without systemd, the mount point won't
automatically unmount if the script is cancelled with ctrl+c.

Fixes: https://github.com/Zygo/bees/issues/281
Signed-off-by: Kai Krakow <kai@kaishome.de>
2025-01-20 00:05:57 -05:00
Zygo Blaxell
ad11db2ee1 openat2: supply the missing definitions for building with old headers and new kernel
Apparently Ubuntu 20 has upgraded to kernel 5.15, but still builds things
with 5.4 headers.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:20:06 -05:00
Zygo Blaxell
874832dc58 openat2: log a warning when we fall back to openat
This should occur only once per run, but it's worth leaving a note
that it has happened.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:19:42 -05:00
Zygo Blaxell
5fe89d85c3 extent scan: make sure we run every extent crawler once per transaction
There's a pathological case where all of the extent scan crawlers except
one are at the end of a crawl cycle, but the one crawler that is still
running is keeping the Task queue full.  The result is that bees never
starts the other extent scan crawlers, because the queue is always
full at the instant a new transid triggers the start of a new scan.
That's bad because it will result in bees falling behind when new data
from the inactive size tiers appears.

To fix this, check for throttling _after_ creating at least one scan task
in each crawler.  That will keep the crawlers running, and possibly allow
them to claw back some space in the Task queue.  It slightly overcommits
the Task queue, so there will be a few more Tasks than nominally allowed.

Also (re)introduce some hysteresis in the queue size limit and reduce it
a little, so that bees isn't continually stopping and restarting crawls
every time one task is created or completed, and so that we stay under
the configured Task limit despite overcommitting.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:19:42 -05:00
Zygo Blaxell
a2b3e1e0c2 log: demote a lot of BEESLOGWARN to higher verbosity levels
Toxic extent workarounds are going away because the underlying kernel
bugs have been fixed.  They are no longer worthy of spamming non-developer
logs.

INO_PATHS can return no paths if an inode has been deleted.  It doesn't
need a log message at all, much less one at WARN level.

Dedupe failure can be INFO, the same level as dedupe itself, especially
since the "NO dedupe" message doesn't mention what was [not] deduped.

Inspired by Kai Krakow's "context: demote "abandoned toxic match" to
debug log level".

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 01:08:28 -05:00
Kai Krakow
aaec931081 context: demote "abandoned toxic match" to debug log level
This log message creates a overwhelmingly lot of messages in the system
journal, leading to write-back flushing storms under high activity. As
it is a work-around message, it is probably only useful to developers,
thus demote to debug level.

This fixes latency spikes in desktop usage after adding a lot of new
files, especially since systemd-journal starts to flush caches if it
sees memory pressure.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2025-01-19 00:59:22 -05:00
Zygo Blaxell
c53fa04a2f task: fixes for priority and idle Tasks
Tasks are not allowed to be queued more than once, but it is allowed
to queue a Task while it's already running, which means a Task can be
executed on two threads in parallel.  Tasks detect this and handle it
by queueing the Task on its own post-exec queue.  That in turn leads
to Workers which continually execute the same Task if that Task doesn't
create any new Tasks, while other Tasks sit on the Master queue waiting
for a Worker to dequeue them.

For idle Tasks, we don't want the Task to be rescheduled immediately.
We want the idle Task to execute again after every available Task on
both the main and idle queues has been executed.

Fix these by having each Task reschedule itself on the appropriate
queue when it finishes executing.

Priority queued Tasks should executed in priority order not just one
Task's post-exec queue, but the entire local queue of the TaskConsumer.

Fix this by moving the sort into either the TaskConsumer that receives
a post-exec queue, if there is one, or into the Task that is created
to insert the post-exec queue into a TaskConsumer when one becomes
available in the future.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
v0.11-rc3
2025-01-15 00:43:25 -05:00
Zygo Blaxell
d4a681c8a2 Revert "roots: use a non-idle task for next_transid"
next_transid tasks don't respect queue selection very well, because
they effectively end up spinning in a loop until all other worker
threads become busy.

Back this out, and fix the priority handling in the Task library.

This reverts commit 58db4071de5f524c35b1362bfb5b1fceedea503f.
2025-01-12 18:48:33 -05:00
Zygo Blaxell
a819d623f7 task: do not allow queue loops in priority queueing mode
Tasks using non-priority FIFO dependency tracking can insert themselves
into their own queue, to run the Task again immediately after it exits.

For priority queues, this attempts to splice the post-exec queue into
itself, which doesn't seem like a good idea.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-12 15:28:26 -05:00
Zygo Blaxell
de9d72da80 task: flatten queues of dependent Tasks
Suppose Task A, B, and C are created in that order, and currently running.
Task T acquires Exclusion E.  Task B, A, and C attempt to acquire the
same Exclusion, in that order, but fail because Task T holds it.

The result is Task T with a post-exec queue:

        T, [ B, A, C ]  sort_requested

Now suppose Task U acquires Exclusion F, then Task T attempts to acquire
Exclusion F.  Task T fails to acquire F, so T is inserted into U's
post-exec queue.  The result at the end of the execution of T is a tree:

        U, [ T ]  sort_requested
             \-> [ B, A, C ] sort_requested

Task T exits after failing to acquire a lock.  When T exits, T will
sort its post-exec queue and submit the post-exec queue for execution
immediately:

        Worker 1: U, [ T ]  sort_requested
        Worker 2: A, B, C

This isn't ideal because T, A, B, and C all depend on at least one
common Exclusion, so they are likely to immediately conflict with T
when U exits and T runs again.

Ideally, A, B, and C would at least remain in a common queue with T,
and ideally that queue is sorted.

Instead of inserting T into U's post-exec queue, insert T and all
of T's post-exec queue, which creates a single flattened Task list:

        U, [ T, B, A, C ]   sort_requested

Then when U exits, it will sort [ T, B, A, C ] into [ A, B, C, T ],
and run all of the queued Tasks in age priority order:

        U exited, [ T, B, A, C ]   sort_requested

        U exited, [ A, B, C, T ]

        [ A, B, C, T ] on TaskConsumer queue

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-12 14:05:44 -05:00
Zygo Blaxell
74d8bdd60f task: add an insert method for priority-queueing Tasks by age
Task started out as a self-organizing parallel-make algorithm, but ended
up becoming a half-broken wait-die algorithm.  When a contended object
is already locked, Tasks enter a FIFO queue to restart and acquire the
lock.  This is the "die" part of wait-die (all locks on an Exclusion are
non-blocking, so no Task ever does "wait").  The lock queue is FIFO wrt
_lock acquisition order_, not _Task age_ as required by the wait-die
algorithm.

Make it a 25%-broken wait-die algorithm by sorting the Tasks on lock
queues in order of Task ID, i.e. oldest-first, or FIFO wrt Task age.
This ensures the oldest Task waiting for an object is the one to get
it when it becomes available, as expected from the wait-die algorithm.

This should reduce the amount of time Tasks spend on the execution queue,
and reduce memory usage by avoiding the accumulation of Tasks that cannot
make forward progress.

Note that turning `TaskQueue` into an ordered container would have
undesirable side-effects:

 * `std::list` has some useful properties wrt stability of object
 location and cost of splicing.  Other containers may not have these,
 and `std::list` does have a `sort` method.

 * Some Task objects are created at the beginning and reused continually,
 but we really do want those Tasks to be executed in FIFO order wrt
 submission, not Task ID.  We can exclude these tasks by only doing the
 sorting when a Task is queued for an Exclusin object.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-12 00:35:37 -05:00
Zygo Blaxell
a5d078d48b docs: deprecate the --workaround-btrfs-send option
Emphasize that the option is relevant to old kernels, older than the
minimum supportable version threshold.

De-emphasize the use case of "send-workaround" as a synonym for "exclude
read-only".

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:56 -05:00
Zygo Blaxell
e2587cae9b docs: expand "Threads and load management" to suggest not running bees so much
One of the more obvious ways to reduce bees load is to simply not run
it all the time.  Explicitly state using maintenance windows as a load
management option.

SIGUSR1 and SIGUSR2 should have been documented somewhere else before now.
Better late than never.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:56 -05:00
Zygo Blaxell
ac581273d3 docs: config.md updates
The theories behind bees slowing down when presented with a larger has
table turned out to be wrong.  The real cause was a very old bug which
submitted thousands of `LOGICAL_INO` requests when only a handful of
requests were needed.

"Compression on the filesystem" -> "Compression in files"

Don't be so "dramatic".  Be "rapid" instead.

Remove "cannot avoid modifying read-only snapshots" as a distinction
between subvol and extent scans.  Both modes support send workaround
and send waiting with no significant distinction.

Emphasize extent scan's better handling of many snapshots.  Also reflinks.

Add some discussion of `--throttle-factor`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:56 -05:00
Zygo Blaxell
7fcde97b70 docs: update the bug reporting and status instructions
Thread names have changed.  Document some of the newer ones.

Don't jump immediately to blaming poor performance on qgroups or
autodefrag.  These do sometimes have kernel regressions but not all
the time.

Emphasize advantage of controlling bees deferred work requests at the
source, before btrfs gets stuck committing them.

Avoid asserting that it's OK for gdb to crash.

Remove mention of lower-layer block device issues wrt corruption.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
e457f502b7 docs: update kernel bugs page for January 2025
"Kernel" -> "Linux kernel".  If you can run bees on a kernel that isn't
Linux, congratulations!

Emphasize the age of the data corruption warnings.  Once 5.4 reaches
EOL we can remove those.

Simplify the discussion of old kernels and API levels.  There's a
new optional kernel API for `openat2` support at 5.6.  The absolute
minimum kernel version is still 4.2, and will not increase to 4.15
until the subvol scanners are removed.

Remove discussion of bees support for kernels 4.19 (which recently
reached EOL) and earlier.

The `LOGICAL_INO` vs dedupe bug is actually a `LOGICAL_INO` vs clone bug.
Dedupe isn't necessary to reproduce it.

Remove a stray ')'.

Strip out most of the discussion of slow backrefs, as they are no longer a
concern on the range of supported kernel versions.  Leave some description
there because bees still has some vestigial workarounds.

Remove `btrfs send` from the "Unfixed kernel bugs" section, which makes
the section empty, so remove the section too.  bees now handles send on
a subvol reasonably well.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
46815f1a9d docs: update README.md
Emphasize "large" is an upper bound on the size of filesystem bees
can handle.

New strengths:  largest extent first for fixed maintenance windows,
scans data only once (ish), recovers more space

Removed weaknesses:  less temporary space

Need more caps than `CAP_SYS_ADMIN`.

Emphasize DATA CORRUPTION WARNING is an old-kernel thing.

Update copyright year.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
0d251d30f4 docs: update feature interaction lists
Tested on larger filesystems than 100T too, but let's use Fermi
approximation.  Next size is 1P.

Removed interaction with block-level SSD caching subsystems.  These are
really btrfs metadata vs. a lower block layer, and have nothing to do
with bees.

Added mixed block groups to the tested list, as mixed block groups
required explicit support in the extent scanner.

Added btrfs-convert to the tested list.  btrfs-convert has various
problems with space allocation in general, but these can be solved by
carefully ordered balances after conversion, and they have nothing to
do with bees.

In-kernel dedupe is dead and the stubs were removed years ago.  Remove it
from the list.

btrfs send now plays nicely with bees on all supportable kernels, now
that stable/linux-4.19.y is dead.  Send workaround is only needed for
kernels before v5.4 (technically v5.2, but nobody should ever mount a
btrfs with kernel v5.1 to v5.3).  bees will pause automatically when
deduping a subvol that is currently running a send.

bees will no longer gratuitously refragment data that was defragmented
by autodefrag.

Explicitly list all the RAID profiles tested so far, as there have been
some new ones.

Explicitly list other deduplicators tested.

Sort the list of btrfs features alphabetically.

Add scrub and balance, which have been tested with bees since the
beginning.

New tested btrfs features:  block-group-tree, raid1c3, raid1c4.

New untested btrfs features:  squotas, raid-stripe-tree.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
b8dd9a2db0 progress: put a timestamp in the bottom row
This records the time when the progress data was calculated, to help
indicate when the data might be very old.

While we're here, move "now" out of the loop so there's only one value.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00