fiewalk and fiemap depend on a lot of crucible, and incremental builds
fail hard without proper dependency tracking.
All binaries must be rebuilt when makeflags changes. This dependency
exists already in lib and test, but src was missing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Now that tempfiles are using pool checkin functions to control their
size, we don't need a size limit in realign().
We keep the limit in make_copy because it's a sanity check against
letting a multi-terabyte copy operation slip through.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Get rid of the thread-local TempFiles and use Pool instead. This
eliminates a potential FD leak when the loadavg governor repeatedly
creates and destroys threads.
With the old per-thread TempFiles, we were guaranteed to have exclusive
ownership of the TempFile object within the current thread. Pool is
somewhat stricter: it only guarantees ownership while the checked-out
Handle exists. Adjust the users of TempFile objects to ensure they hold
the Handle object until they are finished using the TempFile.
It appears that maintaining large, heavily-reflinked, long-lived temporary
files costs more than truncating after every use: btrfs has to write
multiple references to the temporary file's extents, then some commits
later, remove references as the temporary file is deleted or truncated.
Using the temporary file in a dedupe operation flushes the data to disk,
so nothing is saved by pretending that there is writeback pipelining and
trying to avoid flushes in truncate. Pool provides usage tracking and
a checkin callback, so use it to truncate the temporary file immediately
after every use.
Redesign TempFile so that every instance creates exactly one Fd which
persists over the lifetime of the TempFile object. Provide a reset()
method which resets the file back to the initial state and call it from
the Pool checkin callback. This makes TempFile's lifetime equivalent to
its Fd's lifetime, which simplifies interactions with FdCache and Roots.
This change means we can now blacklist temporary files without having
an effective memory leak, so do that. We also have a reason to ever
remove something from the blacklist, so add a method for that too.
In order to move to extent-centric addressing, we need to be able to
reliably open temporary files by root and inode number. Previously we
would place TempFile fd's into the cache with insert_root_ino, but the
cache would be cleared periodically, and it would not be possible to
reopen temporary files after that happened. Now that the TempFile's
lifetime is the same as the TempFile Fd's lifetime, we can have TempFile
manage a separate FileId -> Fd map in Roots which is unaffected by the
periodic cache clearing. BeesRoots::open_root_ino_nocache will check
this map before attempting to open the file via btrfs root+ino lookup,
and return it through the cache as if Roots had opened the file via btrfs.
Hold a reference to BeesRoots in BeesTempFile because the usual way
to get such a reference now throws an exception in BeesTempFile's
destructor.
These changes make method BeesTempFile::create() and all methods named
insert_root_ino unnecessary, so delete them.
We construct and destroy TempFiles much less often now, so make their
constructor and destructor more informative.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A prealloc extent reference can be deduped immediately and asynchronously.
There is no need to slow down extent scanning to do it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
I was never able to prove a connection between fsync() and deadlock bugs.
There were too many deadlock bugs to be able to isolate a bug that is
triggered specifically by fsync.
Update the comment (which has been unchanged since kernel 4.14). We still
may want to do fsync() on temporary files someday, but there's a full
internal API rewrite between here and there.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Number of items should be low enough that we don't have too many stale
items, but high enough to amortize system call overhead to a reasonable
ratio.
Number of bytes should be constant: one worst-case metadata page (the
btrfs limit is 64K, though 16K is much more common) so that we always
have enough space for one worst-case item; otherwise, we get EOVERFLOW
if we set the number of items too low and there's a big item in the tree,
and we can't make further progress.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are lots of ways the search can fail, but it's hard to pick one
without knowing the parameters.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It's a pain to read, edit, and format large blocks of text in C++ code,
so rip the usage message out of bees.cc and put it in a plain text file.
Use a minimal translator to convert it into a C string.
While we're here, remove the multiple roots feature from the command
line synopsis, as we don't really support it any more. Also clarify
that "id 5" is "subvol id 5", and describe in one sentence what
workaround-btrfs-send does.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Remove unused function getenv_or_die. All of our environment variable
parameters are optional or have default values.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A long time ago, when bees used dedicated threads to scan each subvol, the
calculation of the "dedup_unique_bytes" statistic was still wrong.
This stat can only be calculated when dedupe runs on extent data items
instead of extent reference items. Remove the stat variable until
that happens.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
uncaught_exception() had only the one valid use case, and it can be
reimplemented by literally calling current_exception() instead.
current_exception() has several valid use cases, so it is not likely
to be deprecated any time soon.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We cannot use BeesContext::roots() until after
BeesContext::set_root_path() has been called.
Save up the parameter settings until then.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This avoids some kernel bugs. One of them is fixed in 5.3.4 and later:
efad8a853a "Btrfs: fix use-after-free when using the tree modification log"
There are apparently others in current kernels, so for now just put bees
on pause until the balance is done.
At some point we may want to provide an option to disable this
workaround; however, running bees and balance at the same time makes
neither particularly fast, so maybe we'll just leave it this way.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Saying just "This feature" at some log levels could be puzzling. Let's
remove this message, the feature works without problems for a year.
Signed-off-by: Kai Krakow <kai@kaishome.de>
In version 2.30 glibc added it's own gettid() function. This resulted in
"error: call of overloaded ‘gettid()’ is ambiguous" because gettid()
now exists in both namespace crucible and std.
For now, use explicit references to namespace crucible. This continues
to work with new and old libc without having to test specific library
versions.
At some point, glibc gettid() will be deployed widely enough that we can
remove the crucible version entirely.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Localize the hash function in bees to a single spot to make it easier
to change later (or at runtime).
Remove some code that was using a property of CRC as an optimization.
The optimization doesn't work for other hash functions, and running the
CRC function takes more CPU time than the optimization saved.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Some build environments (ARM? AARCH64?) do not have the fields
si_lower and si_upper in siginfo.
bees doesn't need them, so don't try to access them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This is especially useful when dynamic load management allocates more
worker threads than active tasks, so the extra threads are effectively
invisible.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Introduce a mechanism to suppress exceptions which do not produce a
full stack trace for common known cases where a loop should be aborted.
Use this mechanism to suppress the infamous "FIXME" exception.
Reduce the log level to at most NOTICE, and in some cases DEBUG.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
libcrucible at one time in the distant past had to be a shared library
to force global C++ object initialization; however, this is no longer
required.
Make libcrucible static to solve various rpath and soname versioning
issues, especially when distros try (unwisely) to package the library
separately.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We stopped supporting shared hash tables a long time ago. Remove comments
describing the behavior of shared hash tables.
Add an event counter for pushing a hash to the front when it is already at
the front.
Audited the code for a bug related to bucket handling that impairs space
efficiency when the bucket size is greater than 1. Didn't find one.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Capture SIGINT and SIGTERM and shut down, preserving current completed
crawl and hash table state.
* Executing tasks are completed, queued tasks are paused.
* Crawl state is saved.
* The crawl master and crawl writeback threads are terminated.
* The task queue is flushed.
* Dirty hash table extents are flushed.
* Hash prefetch and writeback threads are terminated.
* Hash table is deallocated.
* FD caches and tmpfiles are destroyed.
* Assuming the above didn't crash or deadlock, bees exits.
The above order isn't the fastest, but it does roughly follow the
shared_ptr dependencies and avoids data races--especially those that
might lead to bees reporting an extent scanned when it was only queued
for future scanning that did not occur.
In case of a violation of expected shared_ptr dependency order,
exceptions in BeesContext child object accessor methods (i.e. roots(),
hash_table(), etc) prevent any further progress in threads that somehow
remain unexpectedly active.
Move some threads from main into BeesContext so they can be stopped
via BeesContext. The main thread now runs a loop waiting for signals.
A slow FD leak was discovered in TempFile handling. This has not been
fixed yet, but an implementation detail of the C++ runtime library makes
the leak so slow it may never be important enough to fix.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The deadlock seems to be fixed now (if there ever was one--there certainly
were deadlocks, but matching deadlocks to root causes is non-trivial
and a number of distinct deadlock cases have been fixed in recent years).
The benchmark data is inconclusive about whether it is better to fsync or
not to fsync. A paranoia option might be useful here.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The crawl_master task had a simple atomic variable that was supposed
to prevent duplicate crawl_master tasks from ending up in the queue;
however, this had a race condition that could lead to m_task_running
being set with no crawl_master task running to clear it. This would in
turn prevent crawl_thread from scheduling any further crawl_master tasks,
and bees would eventually stop doing any more work.
A proper fix is to modify the Task class and its friends such that
Task::run() guarantees that 1) at most one instance of a Task is ever
scheduled or running at any time, and 2) if a Task is scheduled while
an instance of the Task is running, the scheduling is deferred until
after the current instance completes. This is part of a fairly large
planned change set, but it's not ready to push now.
So instead, unconditionally push a new crawl_master Task into the queue
on every poll, then silently and quickly exit if the queue is too full
or the supply of new extents is empty. Drop the scheduling-related
members of BeesRoots as they will not be needed when the proper fix lands.
Fixes: 4f0bc78a "crawl: don't block a Task waiting for new transids"
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This sequence of log messages isn't clear:
crawl_master: WORKAROUND: Avoiding RO subvol 6094
crawl_master: WORKAROUND: RO root 6094
The first is from a cache miss, and appears wherever a root is opened
(dedupe or crawl). The second is skipping an entire subvol scan, and
only happens in crawl_master.
Elaborate on the second message a little.
Also use the term "root" consistently when referring to subvol tree IDs.
btrfs refers to these objects by (at least) three distinct names: tree,
subvol, and root. Using three different words for the same thing is worse
than using a single wrong word consistently to refer to the same concept.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
After weeks of testing I copied part of a change to main without copying
the rest of the change, leading to an immediate segfault on startup.
So here is the rest of the change: limit the number of
BeesContexts per process to 1. This change was discussed at
https://github.com/Zygo/bees/issues/54#issuecomment-360332529 but there
are more reasons to do it now: the candidates to replace the current
hash table format are less forgiving of sharing hash tables, and it may
even become necessary to have more than one hash table per BeesContext
instance (e.g. to keep datasum and nodatasum data separate).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
https://github.com/Zygo/bees/issues/91 describes problems encountered
when running bees on systems with many CPU cores.
Limit the computed number of threads (using --thread-factor or the
default) to a maximum of 8 (i.e. the number of logical cores in a modern
laptop). Users can override the limit by using --thread-count.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Introduce --workaround options which trade performance or effectiveness to
avoid triggering kernel bugs.
The first such option is --workaround-btrfs-send, which avoids making any
modification to read-only subvols to avoid btrfs send bugs.
Clean up usage message: no tabs for formatting, split options into
sections by theme.
Make scan mode a non-static data member like all (most?) other options.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The log message is quite CPU-intensive to generate, and some data sets
have enough hash collisions to throw off benchmarks.
Keep the event counter but drop the log message.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit brings back -O3 but in an overridable way. This should make
downstream distributions happy enough to accept it.
While at the subject, let's apply the same fixup logic to LDFLAGS, too.
This commit also properly gets rid of the implicit rules which collided
too easily with the depends.mk.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Better toxic extent detection means we can now handle extents with
many more references--easily hundreds of thousands.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Faster and more reliable toxic extent detection means we can now be much
less paranoid about creating toxic extents.
The paranoia has significant impact on dedupe hit rates because every
extent that contains even one toxic hash is abandoned. The preloaded
toxic hashes were chosen because they occur more frequently than any
other block contents in typical filesystem data. The combination of these
resulted in as much as 30% of duplicate extents being left untouched.
Remove the preloaded toxic extent blacklist, and rely on the new
kernel-CPU-usage-based workaround instead.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We detect toxic extents by measuring how long the LOGICAL_INO ioctl takes
to run. If it is above some threshold, we consider the extent toxic,
and blacklist it; otherwise, we process the extent normally.
The detector was using the execution time of the ioctl, which detects
toxic extents, but it also detects pauses of the bees process and
transaction commit latency due to load. This leads to a significant
number of false positives. The detection threshold was also very long,
burning a lot of kernel CPU before the detection was triggered.
Use the per-thread system CPU statistics to measure the kernel CPU usage
of the LOGICAL_INO call directly. This is much more reliable because it
is not confounded by other threads, and it's faster because we can set
the time threshold two orders of magnitude lower.
Also remove the lock and mutex added in "context: serialize LOGICAL_INO
calls" because we theoretically no longer need it (but leave the code
there with #if 0 in case we do need it in practice).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
ROOT_TREE contains the ROOT_ITEM for EXTENT_TREE. Every modification
(that we care about) to a btrfs must go through EXTENT_TREE, and must
modify the page in ROOT_TREE pointing to the root of EXTENT_TREE...
which makes that a very good source for the filesystem transid.
Remove the loop and the root lookups, and just look at one item for
max_transid.
Also note that every caller of transid_max_nocache() immediately
feeds the return value to m_transid_re.update(), so don't do that
inside transid_max_nocache().
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It turns out that we do need to scan all the subvols in order
to find transid_max.
Keep the bug fix though.
This reverts commit bf6ae80eeec6afcbee505d22af8e62f60dc1c9a6.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesRoots::transid_max_nocache calls btrfs_get_root_transid() which
retrieves the transid of the root of the given Fd. Since the FS_TREE
(subvol 5) is the root of the subvol hierarchy, it will always have
the highest transid on the filesystem, and we do not need to look at
any others.
Also fix a bug where we pass BTRFS_FS_TREE_OBJECTID instead of the
file descriptor root_fd() to btrfs_get_root_transid(). If BEESHOME
is somewhere on the same btrfs filesystem, and there are no leaked FDs
at bees startup, then BTRFS_FS_TREE_OBJECTID (5) usually has the same
integer value as a valid file descriptor of some object on the filesystem
that has a regularly increasing transid value. If Fd 5 happens to be a
file in BEESHOME then bees itself drives the transid increments. This,
combined with the search of all subvol roots, hides the bug (unless Fd
5 gets closed somehow).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesContext::home_fd() is supposed to open $BEESHOME once and cache
the Fd for later calls; however, instead it was reopening a new Fd each
time it was called, and _also_ holding that Fd in a BeesContext member.
Fds clean themselves up when they are forgotten, so it was not leaking
per se, but it certainly had more open Fds than it needed to.
Check to see if we have m_home_fd open, and return that if so.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
LOGICAL_INO can trip over the btrfs slow-backrefs bug, resulting in
some very long in-kernel runtimes. If too many threads are executing
LOGICAL_INO then there may be no cores left on the system to run other
tasks.
Toxic extent detection is done by a very rudimentary algorithm which
can be confused by unrelated sources of latency within btrfs (especially
commit latency). The algorithm can also be confused by other threads
executing the LOGICAL_INO ioctl.
These are two good reasons to prevent any two threads in a single bees
process instance from executing LOGICAL_INO at the same time, so let's
do that.
It is possible to limit the number of threads executing LOGICAL_INO with
the -c and -C options; however, this also limits the number of threads
which can perform any operation, while only LOGICAL_INO (*) has such a
profound effect on the rest of system operation.
Also make the status message clearer about exactly when LOGICAL_INO is
executed, as opposed to merely waiting to acquire a lock before executing
the ioctl.
(*) or maybe FILE_EXTENT_SAME. The problem function that keeps showing
up in kernel stack traces is find_parent_nodes, which is called by both
the LOGICAL_INO and FILE_EXTENT_SAME ioctls. We'll try this change
first and see if it prevents any recurrences of forced watchdog reboots;
if it does not, then we'll limit FILE_EXTENT_SAME the same way.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The ordering function for BeesCrawlState did not consider
root 292 inode 0 min_transid 2345 max_transid 3456
to be larger than
root 292 inode 258 min_transid 2345 max_transid 2345
so when we attempted to update the end pointer for the crawl progress,
the new state was not considered newer than the old state because the
min_transid was equal, but the new crawl state's inode number was smaller.
Normally this is not a problem because subvol scans typically begin
and end in separate transactions (in part because we don't start a
subvol scan until at least two transactions are available); however,
the cleanup code for the aftermath of the recent transid_min() bug can
create crawlers with equal max_transid and min_transid records.
Fix this by ordering both transid fields before any others in the
crawl state.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Due to an earlier bug some beescrawl.dat files will contain uint64_t
max as max_transid. This prevents any further scanning on the subvol
because there is no possibiity of having a real transid (or any other
uint64_t number) larger than uint64_t max.
If we detect a bad transid in beescrawl.dat, log a warning, then use
some more plausible value: either min_transid to repeat the previous
incremental crawl, or 0 to restart the subvol scan from the beginning.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
On a few test machines max_transid on subvols is getting set to
18446744073709551615 (aka uint64_t max).
Prevent transid_min() from ever returning this value.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
"saved" is used only during hash table correctness analysis, which is
normally not enabled at compile time, and requires source modification
to enable.
Remove the pointless copy and save a tiny bit of CPU.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>