Tasks using non-priority FIFO dependency tracking can insert themselves
into their own queue, to run the Task again immediately after it exits.
For priority queues, this attempts to splice the post-exec queue into
itself, which doesn't seem like a good idea.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Suppose Task A, B, and C are created in that order, and currently running.
Task T acquires Exclusion E. Task B, A, and C attempt to acquire the
same Exclusion, in that order, but fail because Task T holds it.
The result is Task T with a post-exec queue:
T, [ B, A, C ] sort_requested
Now suppose Task U acquires Exclusion F, then Task T attempts to acquire
Exclusion F. Task T fails to acquire F, so T is inserted into U's
post-exec queue. The result at the end of the execution of T is a tree:
U, [ T ] sort_requested
\-> [ B, A, C ] sort_requested
Task T exits after failing to acquire a lock. When T exits, T will
sort its post-exec queue and submit the post-exec queue for execution
immediately:
Worker 1: U, [ T ] sort_requested
Worker 2: A, B, C
This isn't ideal because T, A, B, and C all depend on at least one
common Exclusion, so they are likely to immediately conflict with T
when U exits and T runs again.
Ideally, A, B, and C would at least remain in a common queue with T,
and ideally that queue is sorted.
Instead of inserting T into U's post-exec queue, insert T and all
of T's post-exec queue, which creates a single flattened Task list:
U, [ T, B, A, C ] sort_requested
Then when U exits, it will sort [ T, B, A, C ] into [ A, B, C, T ],
and run all of the queued Tasks in age priority order:
U exited, [ T, B, A, C ] sort_requested
U exited, [ A, B, C, T ]
[ A, B, C, T ] on TaskConsumer queue
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Task started out as a self-organizing parallel-make algorithm, but ended
up becoming a half-broken wait-die algorithm. When a contended object
is already locked, Tasks enter a FIFO queue to restart and acquire the
lock. This is the "die" part of wait-die (all locks on an Exclusion are
non-blocking, so no Task ever does "wait"). The lock queue is FIFO wrt
_lock acquisition order_, not _Task age_ as required by the wait-die
algorithm.
Make it a 25%-broken wait-die algorithm by sorting the Tasks on lock
queues in order of Task ID, i.e. oldest-first, or FIFO wrt Task age.
This ensures the oldest Task waiting for an object is the one to get
it when it becomes available, as expected from the wait-die algorithm.
This should reduce the amount of time Tasks spend on the execution queue,
and reduce memory usage by avoiding the accumulation of Tasks that cannot
make forward progress.
Note that turning `TaskQueue` into an ordered container would have
undesirable side-effects:
* `std::list` has some useful properties wrt stability of object
location and cost of splicing. Other containers may not have these,
and `std::list` does have a `sort` method.
* Some Task objects are created at the beginning and reused continually,
but we really do want those Tasks to be executed in FIFO order wrt
submission, not Task ID. We can exclude these tasks by only doing the
sorting when a Task is queued for an Exclusin object.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Emphasize that the option is relevant to old kernels, older than the
minimum supportable version threshold.
De-emphasize the use case of "send-workaround" as a synonym for "exclude
read-only".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
One of the more obvious ways to reduce bees load is to simply not run
it all the time. Explicitly state using maintenance windows as a load
management option.
SIGUSR1 and SIGUSR2 should have been documented somewhere else before now.
Better late than never.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The theories behind bees slowing down when presented with a larger has
table turned out to be wrong. The real cause was a very old bug which
submitted thousands of `LOGICAL_INO` requests when only a handful of
requests were needed.
"Compression on the filesystem" -> "Compression in files"
Don't be so "dramatic". Be "rapid" instead.
Remove "cannot avoid modifying read-only snapshots" as a distinction
between subvol and extent scans. Both modes support send workaround
and send waiting with no significant distinction.
Emphasize extent scan's better handling of many snapshots. Also reflinks.
Add some discussion of `--throttle-factor`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Thread names have changed. Document some of the newer ones.
Don't jump immediately to blaming poor performance on qgroups or
autodefrag. These do sometimes have kernel regressions but not all
the time.
Emphasize advantage of controlling bees deferred work requests at the
source, before btrfs gets stuck committing them.
Avoid asserting that it's OK for gdb to crash.
Remove mention of lower-layer block device issues wrt corruption.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
"Kernel" -> "Linux kernel". If you can run bees on a kernel that isn't
Linux, congratulations!
Emphasize the age of the data corruption warnings. Once 5.4 reaches
EOL we can remove those.
Simplify the discussion of old kernels and API levels. There's a
new optional kernel API for `openat2` support at 5.6. The absolute
minimum kernel version is still 4.2, and will not increase to 4.15
until the subvol scanners are removed.
Remove discussion of bees support for kernels 4.19 (which recently
reached EOL) and earlier.
The `LOGICAL_INO` vs dedupe bug is actually a `LOGICAL_INO` vs clone bug.
Dedupe isn't necessary to reproduce it.
Remove a stray ')'.
Strip out most of the discussion of slow backrefs, as they are no longer a
concern on the range of supported kernel versions. Leave some description
there because bees still has some vestigial workarounds.
Remove `btrfs send` from the "Unfixed kernel bugs" section, which makes
the section empty, so remove the section too. bees now handles send on
a subvol reasonably well.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Emphasize "large" is an upper bound on the size of filesystem bees
can handle.
New strengths: largest extent first for fixed maintenance windows,
scans data only once (ish), recovers more space
Removed weaknesses: less temporary space
Need more caps than `CAP_SYS_ADMIN`.
Emphasize DATA CORRUPTION WARNING is an old-kernel thing.
Update copyright year.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Tested on larger filesystems than 100T too, but let's use Fermi
approximation. Next size is 1P.
Removed interaction with block-level SSD caching subsystems. These are
really btrfs metadata vs. a lower block layer, and have nothing to do
with bees.
Added mixed block groups to the tested list, as mixed block groups
required explicit support in the extent scanner.
Added btrfs-convert to the tested list. btrfs-convert has various
problems with space allocation in general, but these can be solved by
carefully ordered balances after conversion, and they have nothing to
do with bees.
In-kernel dedupe is dead and the stubs were removed years ago. Remove it
from the list.
btrfs send now plays nicely with bees on all supportable kernels, now
that stable/linux-4.19.y is dead. Send workaround is only needed for
kernels before v5.4 (technically v5.2, but nobody should ever mount a
btrfs with kernel v5.1 to v5.3). bees will pause automatically when
deduping a subvol that is currently running a send.
bees will no longer gratuitously refragment data that was defragmented
by autodefrag.
Explicitly list all the RAID profiles tested so far, as there have been
some new ones.
Explicitly list other deduplicators tested.
Sort the list of btrfs features alphabetically.
Add scrub and balance, which have been tested with bees since the
beginning.
New tested btrfs features: block-group-tree, raid1c3, raid1c4.
New untested btrfs features: squotas, raid-stripe-tree.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This records the time when the progress data was calculated, to help
indicate when the data might be very old.
While we're here, move "now" out of the loop so there's only one value.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This increases resistance to symlink and mount attacks.
Previously, bees could follow a symlink or a mount point in a directory
component of a subvol or file name. Once the file is opened, the open
file descriptor would be checked to see if its subvol and inode matches
the expected file in the target filesystem. Files that fail to match
would be immediately closed.
With openat2 resolve flags, symlinks and mount points terminate path
resolution in the kernel. Paths that lead through symlinks or onto
mount points cannot be opened at all.
Fall back to openat() if openat2() returns ENOSYS, so bees will still
run on kernels before v5.6.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Since we're now using weak symbols for dodgy libc functions, we might
as well do it for gettid() too.
Use the ::gettid() global namespace and let libc override it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
openat2 allows closing more TOCTOU holes, but we can only use it when
the kernel supports it.
This should disappear seamlessly when libc implements the function.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
"ctime", an abbreviation of "cycle time", collides with "ctime", an
abbreviation of "st_ctime", a well-known filesystem term.
"tm_left" fits in the column, so use that.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
* Report position within cycle in units that cannot be mistaken for size or percentage
* Put the total/maximum values in their own row
* Add a start time column
* Change column titles to reference "cycles"
* Use "idle" instead of "finished" when a crawler is not running
* Replace "transid" with "gen" because it's shorter
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The scanners which finish early can become stuck behind scanners that are
able to keep the queue full. Switch the next_transid task to the normal
Task queues so that we force scanners to restart on every new transaction,
possibly deferring already queued work to do so.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add yet another field to the scan/skip report line: the wallclock
time used to process the extent ref.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The total data size should not include metadata or system block groups,
and already does not; however, we still have these block groups in the map
for mapping the crawl pointer to a logical offset within the filesystem.
Rearrange a few lines around the `if` statement so that the map doesn't
contain anything it should not.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The progress indicator was failing on a mixed-bg filesystem because those
filesystems have block groups which have both _DATA and _METADATA bits,
and the filesystem size calculation was excluding block groups that have
_METADATA set. It should exclude block groups that have _DATA not set.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Running bees with no arguments complains about "Only one" path argument.
Replace this with "Exactly one" which uses similar terminology to other
btrfs tools.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
`getopt_long` already supplies a message when an option cannot be parsed,
so there isn't a need to distinguish option parse failures from help
requests.
Fixes: https://github.com/Zygo/bees/pull/277
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Longer latency testing runs are not showing a consistent gain from a
throttle factor of 1.0. Make the default more conservative.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Decaying averages by 10% every 5 minutes gives roughly a half-hour
half-life to the rolling average. Speed that up to once per minute.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We're not adding any more short options, but the debugging code doesn't
work with optvals above 255. Also clean up constness and variable
lifetimes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Measure the time spent running various operations that extend btrfs
transaction completion times (`LOGICAL_INO`, tmpfiles, and dedupe)
and arrange for each operation to run for not less than the average
amount of time by adding a sleep after each operation that takes less
than the average.
The delay after each operation is intended to slow down the rate of
deferred and long-running requests from bees to match the rate at which
btrfs is actually completing them. This may help avoid big spikes in
latency if btrfs has so many requests queued that it has to force a
commit to release memory.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Test machines keep blowing past the 32k file limit. 16 worker
threads at 10,000 files each is much larger than 32k.
Other high-FD-count services like DNS servers ask for million-file
rlimits.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
While a snapshot is being deleted, there will be a continuous stream of
"No ref for extent" messages. This is a common event that does not need
to be reported.
There is an analogous situation when a call to open() fails with ENOENT.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Dedupe is not possible on a subvol where a btrfs send is running:
BTRFS warning (device dm-22): cannot deduplicate to root 259417 while send operations are using it (1 in progress)
btrfs informs a process with EAGAIN that a dedupe could not be performed
due to a running send operation.
It would be possible to save the crawler state at the affected point,
fork a new crawler that avoids the subvol under send, and resume the
crawler state after a successful dedupe is detected; however, this only
helps the intersection of the set of users who have unrelated subvols
that don't share extents, and the set of users who cannot simply delay
dedupe until send is finished. The simplest approach is to simply stop
and wait until the send goes away.
The simplest approach is taken here. When a dedupe fails with EAGAIN,
affected Tasks will poll, approximately once per transaction, until the
dedupe succeeds or fails with a different error.
bees dedupe performance corresponds with the availability of subvols that
can accept dedupe requests. While the dedupe is paused, no new Tasks can
be performed by the worker thread. If subvols are small and isolated
from the bulk of the filesystem data, the result will be a small but
partial loss of dedupe performance during the send as some worker threads
get stuck on the sending subvol. If subvols heavily share extents with
duplicate data in other subvols, worker threads will all become blocked,
and the entire bees process will pause until at least some of the running
sends terminate.
During the polling for btrfs send, the dedupe Task will hold its dst
file open. This open FD won't interfere with snapshot or file delete
because send subvols are always read-only (it is not possible to delete
a file on a RO subvol, open or otherwise) and send itself holds the
affected subvol open, preventing its deletion. Once the send terminates,
the dedupe will terminate soon after, and the normal FD release can occur.
This pausing during btrfs send is unrelated to the
`--workaround-btrfs-send` option, although `--workaround-btrfs-send` will
cause the pausing to trigger less often. It applies to all scan modes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are no callers of this method any more, and it exposes more
of BeesRoots than we really want things to have access to.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
All callers of the `transid_max_nocache` method update `m_transid_re`
with the return value, so do that in `transid_max_nocache` itself.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
* Allow RateLimiter to change rate after construction.
* Check range of rate argument in constructor.
* Atomic increment for RateEstimator.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The "done" pointer and the "%done" fields are still useful because they
indicate _actual_ progress, not the work that has been _promised_.
So it is possible for a crawl to be "finished" (all extents queued)
but not "100.0000%" (some of those extents still active or in the queue).
"deferred" state isn't particularly useful, so drop it.
"finished" state implies no ETA, so that column is unused.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
ETA is calculated using a sample obtained by snooping on bees's normal
crawling operations.
This sample is heavily biased and not representative of the entire
filesystem. If the distribution of extent sizes in the filesystem is
not uniform, the ETA can be wildly wrong.
Collecting an accurate sample set would require extra IO and CPU time
which should be spent doing dedupes instead.
Explicitly label the ETA as inaccurate to avoid having too many users
report the same bug.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
bees might be unpaused at any time, so make sure that the dynamic load
calculation is ready with a non-zero thread count.
This avoids a delay of up to 5 seconds when responding to SIGUSR2
when loadavg tracking is enabled.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These are simple on/off switches for the task queue. They are lightweight
requests for bees to be paused temporarily, but allow bees to release
open files and save progress while paused.
These signals are an alternative to SIGSTOP and SIGCONT, or using the
cgroup freezer's FROZEN and THAWED states, which pause and resume the
bees process, but do not allow the bees process to release open files
or save progress. Snapshot and file deletes can occur on the filesystem
while bees is paused by SIGUSR1 but not by SIGSTOP.
These signals are also an alternative to SIGTERM and restart, which
flush out the whole hash table and progress state on exit, and read
the whole table back into memory on restart.
This feature is experimental and may be replaced by a more general
configuration or runtime control mechanism in the future.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When paused, TaskConsumer threads will eventually notice the paused
condition and exit; however, there's nothing to restart threads when
exiting the paused state.
When unpausing, and while the lock is already held, create TaskConsumer
threads as needed to reach the target thread count.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Commit 72c3bf8438830b65cae7bdaff126053e562280e5 ("fs: handle ENOENT
within lib") was meant to prevent exceptions when a subvol is deleted.
If the search ioctl fails, the kernel won't set nr_items in the
ioctl output, which means `nr_items` still has the input value. When
ENOENT is detected, `this->nr_items` is set to 0, then later `*this =
ioctl_ptr->key` overwrites `this->nr_items` with the original requested
number of items.
This replaced the ENOENT exception with an exception triggered by
interpreting garbage in the memory buffer. The number of exceptions
was reduced because the memory buffers are frequently reused, but upper
layers would then reject the data or ignore it because it didn't match
the key range.
Fix by setting `ioctl_ptr->key.nr_items`, which then overwrites
`this->nr_items`, so the loop that extracts items from the ioctl data
gets the right number of items (i.e. zero).
Fixes: 72c3bf8438830b65cae7bdaff126053e562280e5 ("fs: handle ENOENT within lib")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In some cases the offset and size arguments were flipped when checking to
see if a range had already been read. This would have been OK as long as
the same mistake had been made consistently, since `bees_readahead_check`
only does a cache lookup on the parameters, it doesn't try to use them to
read a file. Alas, there was one case where the correct order was used,
albeit a relatively rare one.
Fix all the calls to use the correct order.
Also fix a comment: the recent request cache is global to all threads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
hexdump was moved into a template in its own header years ago, but
the declaration of the implementation that used to be in fs.cc remains.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
hexdump processes a vector as a contiguous sequence of bytes, regardless
of V's value type, so hexdump should get a pointer and use uint8_t to
read the data.
Some vector types have a lock and some atomics in their operator[], so
let's avoid hammering those.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
operator<< was a friend class that locked the ByteVector, then invoked
hexdump on the bytevector, which used ByteVector::operator[]...which
locked the ByteVector, resulting in a deadlock.
operator<< shouldn't be a friend class anyway. Make hexdump use the
normal public access methods for ByteVector.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Although all the members of BtrfsExtentDataFetcher are theoretically
copiable, there's no need to actually make any such copy.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>