Exceptions were logged at level NOTICE while the stack traces were logged
at level DEBUG. That produced useless noise in the output with `-v5`
or `-v6`, where there were exception headings logged, but no details.
Fix that by placing the exceptions and traces at level DEBUG, but prefix
them with `TRACE:` for easy grepping.
Most of the events associated with BEESLOGTRACE either never happen,
or they are harmless (e.g. trying to open deleted files or subvols).
Reassign them to ordinary BEESLOGDEBUG, with one exception for
unrecognized Extent flags that should be debugged if any appear.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Older kernel versions featured some bugs in btrfs `fsync`, which could
leave behind "ghost dirents", orphan filename items that did not have
a corresponding inode. These dirents were created during log replay
during the first mount after a crash due to several different bugs in
the log tree and its use over the years. The last known bug of this
kind was fixed in kernel 5.16. As of this writing, no fixes for this
bug have been backported to any earlier LTS kernel.
Some filesystems, including btrfs, will flush the contents of a new
file before renaming it over an old file. On paper, btrfs can do this
very cheaply since the contents of the new file are not referenced, and
the old file not dereferenced, until a tree commit which includes both
actions atomically; however, in real life, btrfs provides `fsync`-like
semantics and uses the log-tree infrastructure to implement them, which
compromises performance and acts as a magnet for bugs.
The benefit of this trade-off is that `rename` can be used as a
synchronization point for data outside of the btrfs, which would not
happen if everything `rename` does was simply deferred to the next
tree commit. The cost of this trade-off is that for the first 8 years
of its existence, bees would trigger the bug so often that the project
recommended its users put $BEESHOME in its own subvol to make it easy
to remove ghost dirents left behind by the bug.
Some other filesystems, such as xfs, don't have any special semantics for
`rename`, and require `fsync` to avoid garbage or missing data after
a crash. Even filesystems which do have a special case for `rename`
can be configured to turn it off.
btrfs will silently delete data from files in the event that an
unrecoverable data block write error occurs. Kernel version 6.2 adds
important new and unexpected cases where this can happen on filesystems
using raid56 data, but it also happens in all usable btrfs versions
(the silent deletion behavior was introduced in kernel version 3.9).
Unrecoverable write errors are currently reported to userspace only
through `fsync`. Since the failed extents are deleted, they cannot be
detected via csum failures or scrub after the fact--and it's too late
by then, the data is already gone. `fsync` is the last opportunity
to detect the write failure before the `rename`. If the error is not
detected, the contents of the file will be silently discarded in btrfs.
The impact on bees is that scans will abruptly restart from zero after
a crash combined with some other reasonably common failures.
Putting all of this together leads to a rather complex workaround:
if the filesystem under $BEESHOME (specifically, the filesystem where
BeesStringFile objects such as `beescrawl.dat` are written) is a btrfs
filesystem, and the host kernel is a version prior to 5.16, then don't
call `fsync` before `rename`. In all other cases, do call `fsync`,
and prevent dependent writes (i.e. the following `rename`) in the event
of errors.
Since present kernel versions still require `fsync`, we don't need
an upper bound on the kernel version check until someone fixes btrfs
`rename` (or perhaps adds a flag to `renameat2` which prevents use of
the log tree) in the kernel. Once that fix happens, we can drop the
`fsync` call for kernels after that fixed version.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This obviously doesn't fix or prevent the kernel bug, but it does prevent
bees from triggering the bug without assitance from another application.
The bug can still be triggered by running bees at the same time as an
application which uses clone or LOGICAL_INO. `btdu` uses LOGICAL_INO,
while `cp` from coreutils (and many others) use clone (reflink copy).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The cwd is where core dumps and various profiling and verification
libraries want to write their data, whereas root_fd is the root of the
target filesystem. These are often intentionally different. When
they are different, `--strip-paths` sets the wrong prefix to strip
from paths.
Once the root fd has been established, we can set the path prefix to
the string prefix that we'll get from future calls to `name_fd`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Remove dubious comments and #if 0 section. Document new event counters,
and add one for read failures.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Toxic extent workarounds are going away because the underlying kernel
bugs have been fixed. They are no longer worthy of spamming non-developer
logs.
INO_PATHS can return no paths if an inode has been deleted. It doesn't
need a log message at all, much less one at WARN level.
Dedupe failure can be INFO, the same level as dedupe itself, especially
since the "NO dedupe" message doesn't mention what was [not] deduped.
Inspired by Kai Krakow's "context: demote "abandoned toxic match" to
debug log level".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Running bees with no arguments complains about "Only one" path argument.
Replace this with "Exactly one" which uses similar terminology to other
btrfs tools.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
`getopt_long` already supplies a message when an option cannot be parsed,
so there isn't a need to distinguish option parse failures from help
requests.
Fixes: https://github.com/Zygo/bees/pull/277
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Longer latency testing runs are not showing a consistent gain from a
throttle factor of 1.0. Make the default more conservative.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Decaying averages by 10% every 5 minutes gives roughly a half-hour
half-life to the rolling average. Speed that up to once per minute.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We're not adding any more short options, but the debugging code doesn't
work with optvals above 255. Also clean up constness and variable
lifetimes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Measure the time spent running various operations that extend btrfs
transaction completion times (`LOGICAL_INO`, tmpfiles, and dedupe)
and arrange for each operation to run for not less than the average
amount of time by adding a sleep after each operation that takes less
than the average.
The delay after each operation is intended to slow down the rate of
deferred and long-running requests from bees to match the rate at which
btrfs is actually completing them. This may help avoid big spikes in
latency if btrfs has so many requests queued that it has to force a
commit to release memory.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These are simple on/off switches for the task queue. They are lightweight
requests for bees to be paused temporarily, but allow bees to release
open files and save progress while paused.
These signals are an alternative to SIGSTOP and SIGCONT, or using the
cgroup freezer's FROZEN and THAWED states, which pause and resume the
bees process, but do not allow the bees process to release open files
or save progress. Snapshot and file deletes can occur on the filesystem
while bees is paused by SIGUSR1 but not by SIGSTOP.
These signals are also an alternative to SIGTERM and restart, which
flush out the whole hash table and progress state on exit, and read
the whole table back into memory on restart.
This feature is experimental and may be replaced by a more general
configuration or runtime control mechanism in the future.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In some cases the offset and size arguments were flipped when checking to
see if a range had already been read. This would have been OK as long as
the same mistake had been made consistently, since `bees_readahead_check`
only does a cache lookup on the parameters, it doesn't try to use them to
read a file. Alas, there was one case where the correct order was used,
albeit a relatively rare one.
Fix all the calls to use the correct order.
Also fix a comment: the recent request cache is global to all threads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The serialization doesn't seem to be necessary for the extent scan mode.
No infinite loops in the kernel have been observed in the past two years,
despite never having used MultiLock for the extent scanner.
Leave the serialization for now on the subvol scanners.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This solves a third bad problem with bees reads:
3. The architecture above the read operations will issue read requests
for the same physical blocks over and over in a short period of time.
Fixing that properly requires rewriting the upper-level code, but a
simple small table of recent read requests can reduce the effect of the
problem by orders of magnitude.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This solves some of the worst problems with bees reads:
1. The kernel readahead doesn't work. More precisely, it's much better
adapted for a very different use case: a single thread alternating
between reading a file sequentially and processing the data that was read.
bees has multiple threads which compete for access to IO and then issue
reads in random order immediately after the call to readahead. The kernel
uses idle ioprio scheduling for the readaheads, so the readaheads get
preempted by the random reads, or cancels the readaheads because the
data access pattern isn't sequential after the readahead was issued.
2. Seeking drives perform terribly with multiple competing readers,
especially with btrfs striped profiles where the iops are broken into
tiny stripe-sized pieces. At one point I intended to read the btrfs
device map and figure out which devices can be read in parallel, but to
make that useful, the user needs to have an array with multiple drives
in single profile, or 4+ drives in raid1 profile. In all other cases,
the elaborate calculations always return the same result: there can be
only one reader at a time.
This commit fixes both problems:
1. Don't use the kernel readahead. Use normal reads into a dummy
buffer instead.
2. Allow only one thread to readahead at any time. Once the read is
completed, the data is in the page cache, and all the random-order small
reads that bees does will hit the page cache, not a spinning disk.
In some cases we need to read two things close together, so add a
`bees_readahead_pair` which holds one lock across both reads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Calling 'bees -m4' should not call 'std::terminate()', but it does.
Use catch_all instead. It will still pass the exit value to return
from main.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BEESTOOLONG was always reporting a size of zero, and the offset of the
end of the readahead region. Report the original size instead (and also
in BEESTRACE and BEESNOTE).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This was done on the development branch three years ago, and
has been creating annoying merge conflicts ever since. Sync
up the branches so they have the same names for these.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It seems that readahead() does not work on btrfs, or at least it has
no discernable effect. Enable the workaround instead.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
bees_sync() was an exception-trapping wrapper around fsync() which is
not needed in any of the contexts from which it was called:
1. dedupe operations implicitly flush the src data, so there is
no need to call fsync() to do that twice.
2. crawl position is written to a temporary file and renamed
over the original, which always forces a flush when the original
exists. On the first write, where there is no original, a
crash would result in starting over with an empty or hole-filled
beescrawl file, which is the initial state of bees. There is also
a long history of kernel bugs triggered by fsync() in this case.
3. we use unreadahead to trigger writeback for flushing the
hash table to persistent storage. Here is a space where we might
use fsync after all, as part of bees_unreadahead's emulation of
POSIX_FADV_DONTNEED, but we need to get read-once behavior from
the scanner before we can use this capability.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We need random numbers in more places, so centralize the engines.
Initialize with a proper random seed so every worker thread gets
different behavior.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The hash table is one of the few cases in bees where a non-trivial amount
of page cache memory will be used in a predictable way, so we can advise
the kernel about our IO demands in advance.
Use WILLNEED to prefetch hash table pages at startup.
Use DONTNEED to trigger writeback on hash table pages at shutdown.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In theory, we don't need the pread() loop, because the kernel will do a
better job with readahead().
In practice, we might still need the pread() code, as the readahead will
occur at idle IO priority, which could adversely affect bees performance.
More testing is required.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This allows these components to be used by test executables without
pulling in all of bees, and more rapidly iterate their code.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Currently if crawl throws an exception, we don't have basic information
about what was being crawled or even if the crawler was running at all.
These traces also help identify the causes of early exception failures.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There seem to be multiple ways to do readahead in Linux, and only some
of them work. Hopefully reading the actual data is one of them.
This is an attempt to avoid page-by-page reads in the generic dedupe code.
We load both extents into the VFS cache (read sequentially) and hope they
are still there by the time we call dedupe on them.
We also call readahead(2) and hopefully that either helps or does nothing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Change documentation and comments to use the word "dedupe," not "dedup"
as found in circa-3.15 kernel sources.
No changes in code or program output--if they used "dedup" before, they
will continue to be spelled "dedup" now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Higher CPU core counts became more common, and kernel bugs became less
common, since the arbitrary 8-thread limit was introduced. We can remove
the limit now, and treat any remaining scaling inefficiency as a bug to
be removed.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Apparently it is missing in newer Linux headers, making
builds fail. We don't need it, so remove it.
Closes: #160
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Now that tempfiles are using pool checkin functions to control their
size, we don't need a size limit in realign().
We keep the limit in make_copy because it's a sanity check against
letting a multi-terabyte copy operation slip through.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Get rid of the thread-local TempFiles and use Pool instead. This
eliminates a potential FD leak when the loadavg governor repeatedly
creates and destroys threads.
With the old per-thread TempFiles, we were guaranteed to have exclusive
ownership of the TempFile object within the current thread. Pool is
somewhat stricter: it only guarantees ownership while the checked-out
Handle exists. Adjust the users of TempFile objects to ensure they hold
the Handle object until they are finished using the TempFile.
It appears that maintaining large, heavily-reflinked, long-lived temporary
files costs more than truncating after every use: btrfs has to write
multiple references to the temporary file's extents, then some commits
later, remove references as the temporary file is deleted or truncated.
Using the temporary file in a dedupe operation flushes the data to disk,
so nothing is saved by pretending that there is writeback pipelining and
trying to avoid flushes in truncate. Pool provides usage tracking and
a checkin callback, so use it to truncate the temporary file immediately
after every use.
Redesign TempFile so that every instance creates exactly one Fd which
persists over the lifetime of the TempFile object. Provide a reset()
method which resets the file back to the initial state and call it from
the Pool checkin callback. This makes TempFile's lifetime equivalent to
its Fd's lifetime, which simplifies interactions with FdCache and Roots.
This change means we can now blacklist temporary files without having
an effective memory leak, so do that. We also have a reason to ever
remove something from the blacklist, so add a method for that too.
In order to move to extent-centric addressing, we need to be able to
reliably open temporary files by root and inode number. Previously we
would place TempFile fd's into the cache with insert_root_ino, but the
cache would be cleared periodically, and it would not be possible to
reopen temporary files after that happened. Now that the TempFile's
lifetime is the same as the TempFile Fd's lifetime, we can have TempFile
manage a separate FileId -> Fd map in Roots which is unaffected by the
periodic cache clearing. BeesRoots::open_root_ino_nocache will check
this map before attempting to open the file via btrfs root+ino lookup,
and return it through the cache as if Roots had opened the file via btrfs.
Hold a reference to BeesRoots in BeesTempFile because the usual way
to get such a reference now throws an exception in BeesTempFile's
destructor.
These changes make method BeesTempFile::create() and all methods named
insert_root_ino unnecessary, so delete them.
We construct and destroy TempFiles much less often now, so make their
constructor and destructor more informative.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
I was never able to prove a connection between fsync() and deadlock bugs.
There were too many deadlock bugs to be able to isolate a bug that is
triggered specifically by fsync.
Update the comment (which has been unchanged since kernel 4.14). We still
may want to do fsync() on temporary files someday, but there's a full
internal API rewrite between here and there.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It's a pain to read, edit, and format large blocks of text in C++ code,
so rip the usage message out of bees.cc and put it in a plain text file.
Use a minimal translator to convert it into a C string.
While we're here, remove the multiple roots feature from the command
line synopsis, as we don't really support it any more. Also clarify
that "id 5" is "subvol id 5", and describe in one sentence what
workaround-btrfs-send does.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
uncaught_exception() had only the one valid use case, and it can be
reimplemented by literally calling current_exception() instead.
current_exception() has several valid use cases, so it is not likely
to be deprecated any time soon.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We cannot use BeesContext::roots() until after
BeesContext::set_root_path() has been called.
Save up the parameter settings until then.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In version 2.30 glibc added it's own gettid() function. This resulted in
"error: call of overloaded ‘gettid()’ is ambiguous" because gettid()
now exists in both namespace crucible and std.
For now, use explicit references to namespace crucible. This continues
to work with new and old libc without having to test specific library
versions.
At some point, glibc gettid() will be deployed widely enough that we can
remove the crucible version entirely.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Some build environments (ARM? AARCH64?) do not have the fields
si_lower and si_upper in siginfo.
bees doesn't need them, so don't try to access them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Introduce a mechanism to suppress exceptions which do not produce a
full stack trace for common known cases where a loop should be aborted.
Use this mechanism to suppress the infamous "FIXME" exception.
Reduce the log level to at most NOTICE, and in some cases DEBUG.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Capture SIGINT and SIGTERM and shut down, preserving current completed
crawl and hash table state.
* Executing tasks are completed, queued tasks are paused.
* Crawl state is saved.
* The crawl master and crawl writeback threads are terminated.
* The task queue is flushed.
* Dirty hash table extents are flushed.
* Hash prefetch and writeback threads are terminated.
* Hash table is deallocated.
* FD caches and tmpfiles are destroyed.
* Assuming the above didn't crash or deadlock, bees exits.
The above order isn't the fastest, but it does roughly follow the
shared_ptr dependencies and avoids data races--especially those that
might lead to bees reporting an extent scanned when it was only queued
for future scanning that did not occur.
In case of a violation of expected shared_ptr dependency order,
exceptions in BeesContext child object accessor methods (i.e. roots(),
hash_table(), etc) prevent any further progress in threads that somehow
remain unexpectedly active.
Move some threads from main into BeesContext so they can be stopped
via BeesContext. The main thread now runs a loop waiting for signals.
A slow FD leak was discovered in TempFile handling. This has not been
fixed yet, but an implementation detail of the C++ runtime library makes
the leak so slow it may never be important enough to fix.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The deadlock seems to be fixed now (if there ever was one--there certainly
were deadlocks, but matching deadlocks to root causes is non-trivial
and a number of distinct deadlock cases have been fixed in recent years).
The benchmark data is inconclusive about whether it is better to fsync or
not to fsync. A paranoia option might be useful here.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>