There seem to be multiple ways to do readahead in Linux, and only some
of them work. Hopefully reading the actual data is one of them.
This is an attempt to avoid page-by-page reads in the generic dedupe code.
We load both extents into the VFS cache (read sequentially) and hope they
are still there by the time we call dedupe on them.
We also call readahead(2) and hopefully that either helps or does nothing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Change documentation and comments to use the word "dedupe," not "dedup"
as found in circa-3.15 kernel sources.
No changes in code or program output--if they used "dedup" before, they
will continue to be spelled "dedup" now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Higher CPU core counts became more common, and kernel bugs became less
common, since the arbitrary 8-thread limit was introduced. We can remove
the limit now, and treat any remaining scaling inefficiency as a bug to
be removed.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Support for multiple BeesContext objects sharing a FdCache was wasting
significant space and atomic inc/dec memory cycles for no good reason
since the shared-FdCache feature was deprecated.
open_root and open_root_ino still need a BeesContext to work. Pass the
BeesContext pointer through the function object instead of the cache
key arguments.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
pthread_cancel doesn't really work properly. It was only being used in
bees to bring threads to a stop if the BeesContext is destroyed early.
It is frequently implicated in core dump reports because of the fragility
of the C++ iostream / C stdio / library infrastructure, particularly
surrounding upgrades on the host running bees. The pthread_cancel call
itself often simply fails even when it doesn't call terminate().
Defer creation of the status and progress threads until after the
BeesContext::start method is invoked. At that point, the existing
ask-threads-nicely-to-stop code is up and running, and normal condvars
can be used to bring bees to a stop, without having to resort to
pthread_cancel.
Since we're deleting half of the BeesContext constructor in this change,
let's remove the other half too, and put an end to the deprecated support
for multiple BeesContexts sharing a process. It's still possible to run
multiple BeesContexts, but they will not share a FD cache. This will
allow the FD cache's keys to become smaller and hopefully save some
memory later on.
Fixes: #171
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The weird things distros do to the path where uuid.h gets installed
have broken bees builds for the last time.
We were only using uuid to support a legacy feature that was removed
over four years ago.
Hypothetical users who are upgrading directly from bees v0.1 should
probably restart all the crawlers anyway--there were bugs. Also, if any
such users exist, I respect their tremendous patience with the horrible
performance all these years--bees got about 30x faster since v0.1.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make these workarounds configurable in src/bees.h instead of #if 0
code blocks. Someday we'll make the constants in bees.h configurable
through a file or similar.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Get rid of the thread-local TempFiles and use Pool instead. This
eliminates a potential FD leak when the loadavg governor repeatedly
creates and destroys threads.
With the old per-thread TempFiles, we were guaranteed to have exclusive
ownership of the TempFile object within the current thread. Pool is
somewhat stricter: it only guarantees ownership while the checked-out
Handle exists. Adjust the users of TempFile objects to ensure they hold
the Handle object until they are finished using the TempFile.
It appears that maintaining large, heavily-reflinked, long-lived temporary
files costs more than truncating after every use: btrfs has to write
multiple references to the temporary file's extents, then some commits
later, remove references as the temporary file is deleted or truncated.
Using the temporary file in a dedupe operation flushes the data to disk,
so nothing is saved by pretending that there is writeback pipelining and
trying to avoid flushes in truncate. Pool provides usage tracking and
a checkin callback, so use it to truncate the temporary file immediately
after every use.
Redesign TempFile so that every instance creates exactly one Fd which
persists over the lifetime of the TempFile object. Provide a reset()
method which resets the file back to the initial state and call it from
the Pool checkin callback. This makes TempFile's lifetime equivalent to
its Fd's lifetime, which simplifies interactions with FdCache and Roots.
This change means we can now blacklist temporary files without having
an effective memory leak, so do that. We also have a reason to ever
remove something from the blacklist, so add a method for that too.
In order to move to extent-centric addressing, we need to be able to
reliably open temporary files by root and inode number. Previously we
would place TempFile fd's into the cache with insert_root_ino, but the
cache would be cleared periodically, and it would not be possible to
reopen temporary files after that happened. Now that the TempFile's
lifetime is the same as the TempFile Fd's lifetime, we can have TempFile
manage a separate FileId -> Fd map in Roots which is unaffected by the
periodic cache clearing. BeesRoots::open_root_ino_nocache will check
this map before attempting to open the file via btrfs root+ino lookup,
and return it through the cache as if Roots had opened the file via btrfs.
Hold a reference to BeesRoots in BeesTempFile because the usual way
to get such a reference now throws an exception in BeesTempFile's
destructor.
These changes make method BeesTempFile::create() and all methods named
insert_root_ino unnecessary, so delete them.
We construct and destroy TempFiles much less often now, so make their
constructor and destructor more informative.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Number of items should be low enough that we don't have too many stale
items, but high enough to amortize system call overhead to a reasonable
ratio.
Number of bytes should be constant: one worst-case metadata page (the
btrfs limit is 64K, though 16K is much more common) so that we always
have enough space for one worst-case item; otherwise, we get EOVERFLOW
if we set the number of items too low and there's a big item in the tree,
and we can't make further progress.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It's a pain to read, edit, and format large blocks of text in C++ code,
so rip the usage message out of bees.cc and put it in a plain text file.
Use a minimal translator to convert it into a C string.
While we're here, remove the multiple roots feature from the command
line synopsis, as we don't really support it any more. Also clarify
that "id 5" is "subvol id 5", and describe in one sentence what
workaround-btrfs-send does.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This avoids some kernel bugs. One of them is fixed in 5.3.4 and later:
efad8a853a "Btrfs: fix use-after-free when using the tree modification log"
There are apparently others in current kernels, so for now just put bees
on pause until the balance is done.
At some point we may want to provide an option to disable this
workaround; however, running bees and balance at the same time makes
neither particularly fast, so maybe we'll just leave it this way.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Localize the hash function in bees to a single spot to make it easier
to change later (or at runtime).
Remove some code that was using a property of CRC as an optimization.
The optimization doesn't work for other hash functions, and running the
CRC function takes more CPU time than the optimization saved.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Introduce a mechanism to suppress exceptions which do not produce a
full stack trace for common known cases where a loop should be aborted.
Use this mechanism to suppress the infamous "FIXME" exception.
Reduce the log level to at most NOTICE, and in some cases DEBUG.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Capture SIGINT and SIGTERM and shut down, preserving current completed
crawl and hash table state.
* Executing tasks are completed, queued tasks are paused.
* Crawl state is saved.
* The crawl master and crawl writeback threads are terminated.
* The task queue is flushed.
* Dirty hash table extents are flushed.
* Hash prefetch and writeback threads are terminated.
* Hash table is deallocated.
* FD caches and tmpfiles are destroyed.
* Assuming the above didn't crash or deadlock, bees exits.
The above order isn't the fastest, but it does roughly follow the
shared_ptr dependencies and avoids data races--especially those that
might lead to bees reporting an extent scanned when it was only queued
for future scanning that did not occur.
In case of a violation of expected shared_ptr dependency order,
exceptions in BeesContext child object accessor methods (i.e. roots(),
hash_table(), etc) prevent any further progress in threads that somehow
remain unexpectedly active.
Move some threads from main into BeesContext so they can be stopped
via BeesContext. The main thread now runs a loop waiting for signals.
A slow FD leak was discovered in TempFile handling. This has not been
fixed yet, but an implementation detail of the C++ runtime library makes
the leak so slow it may never be important enough to fix.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The crawl_master task had a simple atomic variable that was supposed
to prevent duplicate crawl_master tasks from ending up in the queue;
however, this had a race condition that could lead to m_task_running
being set with no crawl_master task running to clear it. This would in
turn prevent crawl_thread from scheduling any further crawl_master tasks,
and bees would eventually stop doing any more work.
A proper fix is to modify the Task class and its friends such that
Task::run() guarantees that 1) at most one instance of a Task is ever
scheduled or running at any time, and 2) if a Task is scheduled while
an instance of the Task is running, the scheduling is deferred until
after the current instance completes. This is part of a fairly large
planned change set, but it's not ready to push now.
So instead, unconditionally push a new crawl_master Task into the queue
on every poll, then silently and quickly exit if the queue is too full
or the supply of new extents is empty. Drop the scheduling-related
members of BeesRoots as they will not be needed when the proper fix lands.
Fixes: 4f0bc78a "crawl: don't block a Task waiting for new transids"
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
https://github.com/Zygo/bees/issues/91 describes problems encountered
when running bees on systems with many CPU cores.
Limit the computed number of threads (using --thread-factor or the
default) to a maximum of 8 (i.e. the number of logical cores in a modern
laptop). Users can override the limit by using --thread-count.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Introduce --workaround options which trade performance or effectiveness to
avoid triggering kernel bugs.
The first such option is --workaround-btrfs-send, which avoids making any
modification to read-only subvols to avoid btrfs send bugs.
Clean up usage message: no tabs for formatting, split options into
sections by theme.
Make scan mode a non-static data member like all (most?) other options.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Better toxic extent detection means we can now handle extents with
many more references--easily hundreds of thousands.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Faster and more reliable toxic extent detection means we can now be much
less paranoid about creating toxic extents.
The paranoia has significant impact on dedupe hit rates because every
extent that contains even one toxic hash is abandoned. The preloaded
toxic hashes were chosen because they occur more frequently than any
other block contents in typical filesystem data. The combination of these
resulted in as much as 30% of duplicate extents being left untouched.
Remove the preloaded toxic extent blacklist, and rely on the new
kernel-CPU-usage-based workaround instead.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We detect toxic extents by measuring how long the LOGICAL_INO ioctl takes
to run. If it is above some threshold, we consider the extent toxic,
and blacklist it; otherwise, we process the extent normally.
The detector was using the execution time of the ioctl, which detects
toxic extents, but it also detects pauses of the bees process and
transaction commit latency due to load. This leads to a significant
number of false positives. The detection threshold was also very long,
burning a lot of kernel CPU before the detection was triggered.
Use the per-thread system CPU statistics to measure the kernel CPU usage
of the LOGICAL_INO call directly. This is much more reliable because it
is not confounded by other threads, and it's faster because we can set
the time threshold two orders of magnitude lower.
Also remove the lock and mutex added in "context: serialize LOGICAL_INO
calls" because we theoretically no longer need it (but leave the code
there with #if 0 in case we do need it in practice).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The 16MB hash table extent size did not serve any useful defragmentation
or compression purpose, and for very small filesystems (under 100GB),
16MB is much larger than necessary.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Log messages were already labelled with log levels, but there was no
way to filter by log level at run time.
Implement the filter inside the bees process so it can skip evaluation
of the BEESLOG* arguments if the log messages would not be emitted.
Fixes: https://github.com/Zygo/bees/issues/67
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Linux kernel 4.14, while resistant to extent toxicity, is not immune to it.
Go back to the paranoid setting to avoid tying up filesystems in
ridiculously long kernel loops in find_parent_nodes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The task queue can become very large with many subvols, requiring hours
for the queue to clear. 'beescrawl.dat' saves in the meantime will save
the work currently scheduled, not the work currently completed.
Fix by tracking progress with ProgressTracker. ProgressTracker::begin()
gives the last completed crawl position. ProgressTracker::end() gives
the last scheduled crawl position. begin() does not advance if there
is any item between begin() and end() is not yet completed. In between
are crawled extents that are on the task queue but not yet processed.
The file 'beescrawl.dat' saves the begin() position while the extent
scanning task queue is fed from the end() position.
Also remove an unused method crawl_state_get() and repurpose the
operator<(BeesCrawlState) that nobody was using.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Task should not block for extended periods of time.
Remove the RateEstimator::wait_for() in crawl_roots. When crawl_roots
runs out of data, let the last crawl_task end without rescheduling.
Schedule crawl_task again on transid polls if it was not already running.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BEESLOGNOTE was intended to combine BEESLOG and BEESNOTE, i.e. write a
log message and set the task status message from a single expression.
With the log levels we would now need several more variants
(BEESLOGNOTEDEBUG, BEESLOGNOTEERR...) or a parameter (BEESNOTELOG(DEBUG,
...)).
Or we give up on the idea. This combination was used only 3 times so far.
The log messages and the note message have different editorial styles.
Remove the three instances of BEESLOGNOTE, and make the BEESLOGNOTE
definition equvalent to BEESLOG at LOG_NOTICE level for consistency.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add a third scan mode with alternative trade-offs.
Benefits: Good sequential read performance. Avoids race conditions
described in https://github.com/Zygo/bees/issues/27. Avoids diverting
scan resources into short-lived snapshots before their long-lived
origin subvols are fully scanned.
Drawbacks: Takes the longest time of the three implemented scan-modes
to free space in extents that are shared between snapshots. Uses the
maximum amount of temporary space.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Duplicated code between the different scan modes has slowly been
becoming less and less trivial. Move the code to a method and
make both scan-modes call it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Restartng scans for each transid is a bit aggressive. Scan every 10
transids for a polling rate close to the former BEES_COMMIT_INTERVAL.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
transid_max is now measured at a single point in the crawl_transid thread.
Move the Crawl deferred logic into BeesRoots so it restarts all crawls
when transid_max increases. Gets rid of some messy time arithmetic.
Change name of Crawl thread to "crawl_master" in both thread name and
log messages.
Replace "Next transid" with "Crawl started".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The periodic cache age check was not protected by a lock, so multiple
threads may decide to concurrently clear the cache. This led to
duplicate log messages.
Fix by moving the cache expiry trigger out of FdCache and into Roots,
which knows when transids change and can perform cache clears at exactly
the time they are most relevant, i.e. after something that was deleted
becomes permanently so.
This removes the last references to BEES_COMMIT_INTERVAL, so get rid
of its definition too.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make the crawl polling interval more closely track the commit interval
on the btrfs filesystem. In the future this will provide opportunities
to do things like clear FD caches and stop crawls on deleted subvols,
but triggered by transaction commits instead of arbitrary time intervals.
Rename the "crawl" thread so it no longer has the same name as the "crawl"
task, and repurpose it for dedicated transid polling. Cancel the deletion
of crawl_thread and repurpose it to trigger new crawls and wake up the
main crawl Task when it runs out of data.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Having too many "write a message to the log" primitives is confusing,
and having one that intermittently and silently discards output is even
_more_ confusing.
Replace all BEESINFO with appropriate BEESLOG*s. Usually DEBUG.
Except for one or two that occur too often. Just delete those.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit adds log levels to the output. In systemd, it makes colored
lines, otherwise it's probably just a number. Bees is very chatty, so
this paves the road for log level filtering.
Signed-off-by: Kai Krakow <kai@kaishome.de>
There are two subvol scan algorithms implemented so far. The two modes
are unimaginatively named 0 and 1.
0: sorts extents by (inode, subvol, offset),
1: scans extents round-robin from all subvols.
Algorithm 0 scans references to the same extent at close to the same
time, which is good for performance; however, whenever a snapshot is
created, the scan of the entire filesystem restarts at the beginning of
the new snapshot.
Algorithm 1 makes continuous forward progress even when new snapshots
are created, but it does not benefit from caching and will force the
kernel to reread data multiple times when there are snapshots.
The algorithm can be selected at run-time using the -m or --scan-mode
option.
We can collect some field data on these before replacing them with
an extent-tree-based scanner. Alternatively, for pre-4.14 kernels,
we can keep these two modes as non-default options.
Currently these algorithms have terrible names. TODO: fix that, but
also TODO: delete all that code and do scans directly from the extent
tree instead.
Augment the scan algorithms relative to their earlier implementation by
batching multiple extents to scan from each subvol before switching to
a different subvol.
Sprinkle some BEESNOTEs on the Task objects so that they don't
disappear from the thread status output.
Adjust some timing constants to deal with the increased latency from
competing threads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Distribute incoming extents across a thread pool for faster execution
on multi-core, multi-disk environments.
Switch extent enumeration model to scan extent refs consecutively(ish).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
With kernel 4.14 there is no sign of the previous LOGICAL_INO performance
problems, so there seems to be no need to throttle threads using this
ioctl.
Increase the FD cache size limits and scan thread count. Let the kernel
figure out scheduling.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This avoids PERFORMANCE warnings when large hash tables are used on slow
CPUs or with lots of worker threads. It also simplifies the code (no
locksets, only one object-wide mutex instead of two).
Fixed a few minor bugs along the way (e.g. we were not setting the dirty
flag on the right hash table extent when we detected hash table errors).
Simplified error handling: IO errors on the hash table are ignored,
instead of throwing an exception into the function that tried to use the
hash table.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BLOCK_SIZE_MIN_EXTENT_DEFRAG, BLOCK_SIZE_MIN_EXTENT_SPLIT, and others
are no longer used. Remove them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit a3d7032edaf5fc584412d0dcf8773f1cafa8f2dc)
This lets us use more default constructors.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 8a932a632ff4602a0357ed5fbcd3f86b6bc50283)
This will allow the default size limit for cache objects to be changed
with impunity.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 9daa51edaab44c02ce0917ff94b20683036d7594)
Holding file FDs open for long periods of time delays inode destruction.
For very large files this can lead to excessive delays while bees dedups
data that will cease to be reachable.
Use the same workaround for file FDs (in the root_ino cache) that
is used for subvols (in the root cache): forcibly close all cached
FDs at regular intervals. The FD cache will reacquire FDs from files
that still have existing paths, and will abandon FDs from files that
no longer have existing paths. The non-existing-path case is not new
(bees has always been able to discover deleted inodes) so it is already
handled by existing code.
Fixes: https://github.com/Zygo/bees/issues/18
Signed-off-by: Zygo Blaxell <bees@furryterror.org>