BeesContext::home_fd() is supposed to open $BEESHOME once and cache
the Fd for later calls; however, instead it was reopening a new Fd each
time it was called, and _also_ holding that Fd in a BeesContext member.
Fds clean themselves up when they are forgotten, so it was not leaking
per se, but it certainly had more open Fds than it needed to.
Check to see if we have m_home_fd open, and return that if so.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
LOGICAL_INO can trip over the btrfs slow-backrefs bug, resulting in
some very long in-kernel runtimes. If too many threads are executing
LOGICAL_INO then there may be no cores left on the system to run other
tasks.
Toxic extent detection is done by a very rudimentary algorithm which
can be confused by unrelated sources of latency within btrfs (especially
commit latency). The algorithm can also be confused by other threads
executing the LOGICAL_INO ioctl.
These are two good reasons to prevent any two threads in a single bees
process instance from executing LOGICAL_INO at the same time, so let's
do that.
It is possible to limit the number of threads executing LOGICAL_INO with
the -c and -C options; however, this also limits the number of threads
which can perform any operation, while only LOGICAL_INO (*) has such a
profound effect on the rest of system operation.
Also make the status message clearer about exactly when LOGICAL_INO is
executed, as opposed to merely waiting to acquire a lock before executing
the ioctl.
(*) or maybe FILE_EXTENT_SAME. The problem function that keeps showing
up in kernel stack traces is find_parent_nodes, which is called by both
the LOGICAL_INO and FILE_EXTENT_SAME ioctls. We'll try this change
first and see if it prevents any recurrences of forced watchdog reboots;
if it does not, then we'll limit FILE_EXTENT_SAME the same way.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The ordering function for BeesCrawlState did not consider
root 292 inode 0 min_transid 2345 max_transid 3456
to be larger than
root 292 inode 258 min_transid 2345 max_transid 2345
so when we attempted to update the end pointer for the crawl progress,
the new state was not considered newer than the old state because the
min_transid was equal, but the new crawl state's inode number was smaller.
Normally this is not a problem because subvol scans typically begin
and end in separate transactions (in part because we don't start a
subvol scan until at least two transactions are available); however,
the cleanup code for the aftermath of the recent transid_min() bug can
create crawlers with equal max_transid and min_transid records.
Fix this by ordering both transid fields before any others in the
crawl state.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Due to an earlier bug some beescrawl.dat files will contain uint64_t
max as max_transid. This prevents any further scanning on the subvol
because there is no possibiity of having a real transid (or any other
uint64_t number) larger than uint64_t max.
If we detect a bad transid in beescrawl.dat, log a warning, then use
some more plausible value: either min_transid to repeat the previous
incremental crawl, or 0 to restart the subvol scan from the beginning.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
On a few test machines max_transid on subvols is getting set to
18446744073709551615 (aka uint64_t max).
Prevent transid_min() from ever returning this value.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
"saved" is used only during hash table correctness analysis, which is
normally not enabled at compile time, and requires source modification
to enable.
Remove the pointless copy and save a tiny bit of CPU.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The 16MB hash table extent size did not serve any useful defragmentation
or compression purpose, and for very small filesystems (under 100GB),
16MB is much larger than necessary.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When package maintainers build from a tarball, the .git directory does
not exist to extract the version tag. Let's add a hack to work around
this issue and let them specify `BEES_VERSION="v0.y"` on the make
cmdline.
Github-Bug: https://github.com/Zygo/bees/issues/75
Signed-off-by: Kai Krakow <kai@kaishome.de>
The -g option limits the number of worker threads when the target load
average is exceeded. On some systems the load normally runs high, and
continuous bees operation is required to avoid running out of disk space.
Add a -G/--thread-min option to force at least some threads to continue
running.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The task queue may already be full of tasks when the crawl task is
executed. In this case simply reschedule the crawl task at the
end of the current queue.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add -g / --loadavg-target parameter to track system load and add or
remove bees worker threads dynamically to keep system load close to the
loadavg target. Thread count may vary from zero to the maximum
specified by -c or -C, and is adjusted every 5 seconds.
This is better than implementing a similar load average scheme from
outside of the process (though that is still possible) because the
in-process load tracker does not disrupt the performance timing feedback
mechanisms as a freezer cgroup or SIGSTOP would when controlling bees
from outside. The internal load average tracker can also adjust the
number of active threads while an external tracker can only choose from
the maximum or zero.
Also fix a bug where a Task could deadlock waiting for itself to exit
if it tries to insert a new Task after the number of worker threads has
been set to zero.
Also correct usage message for --scan-mode (values are 0..2) since
we are touching adjacent lines anyway.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Other btrfs utils use readahead() not posix_fadvise().
There does not appear to be a performance or correctness difference
between the three (none, posix_fadvise, or readahead()).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Log messages were already labelled with log levels, but there was no
way to filter by log level at run time.
Implement the filter inside the bees process so it can skip evaluation
of the BEESLOG* arguments if the log messages would not be emitted.
Fixes: https://github.com/Zygo/bees/issues/67
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When BEESLOGINFO is called multiple times it generates separate log
records that can be mixed up when multiple threads dedup.
Use a single BEESLOGINFO call for each dedup to prevent this.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
set() was broken and redundant. Calling hold() and discarding the
returned object has the correct effect.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Linux kernel 4.14, while resistant to extent toxicity, is not immune to it.
Go back to the paranoid setting to avoid tying up filesystems in
ridiculously long kernel loops in find_parent_nodes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
An empty BeesBlockData from the chasing algorithm used to mean that data
was found at the expected location but it does not match; however, there
are now other reasons for this and they occur much more often. The name
is misleading.
Change the name to report more correctly what happens: no data, without
any guess about the reason.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The task queue can become very large with many subvols, requiring hours
for the queue to clear. 'beescrawl.dat' saves in the meantime will save
the work currently scheduled, not the work currently completed.
Fix by tracking progress with ProgressTracker. ProgressTracker::begin()
gives the last completed crawl position. ProgressTracker::end() gives
the last scheduled crawl position. begin() does not advance if there
is any item between begin() and end() is not yet completed. In between
are crawled extents that are on the task queue but not yet processed.
The file 'beescrawl.dat' saves the begin() position while the extent
scanning task queue is fed from the end() position.
Also remove an unused method crawl_state_get() and repurpose the
operator<(BeesCrawlState) that nobody was using.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When both block candidates for dedup are located in the same extent, bees
excludes them from deduplication because the dedup operation would not
free any space (both blocks are still referenced, so neither is deleted).
Candidates in other extents are still considered.
Typically a few blocks are duplicated many thousands or even millions
of times within a filesystem. Many of these blocks appear in the same
extent as each other. In cases where an extent contains an extremely
common duplicate block, it may appear multiple times in many extents.
bees can get into a loop with a very bad worst-case running time: 32768
blocks per extent * 2560 bees reference limit * 256 distinct hash table
entries = 21.5 *billion* iterations...squared, because this loop happens
every time bees encounteres any of the references. Not an infinite
number, but close enough.
In each iteration of the loop, replace_dst detects that both src and dst
block are part of the same btrfs extent data item and therefore should
not be deduped; however, this occurs after the block has been allocated
and read by chase_extent_ref. This dst is discarded, but the outer
loop tries again with another reference to the same block and gets the
same result.
An easy fix for this problem is to stop the loop immediately when the
same physical extent is found in both src and dst. The condition is rare
enough to ignore the negligible space efficiency loss, and filesystem
scan stops dead if the loop is allowed to proceed. An exception is
thrown to terminate the loop at scan_one_extent from within replace_dst.
It would be better to determine the extent bytenr of each candidate
extent and filter them out in scan_one_extent (which reduces the number
of LOGICAL_INO calls as a side-effect), but bees has no code capable of
doing extent data tree lookups with backward iteration yet. Even better
would be to change the hash table format so that the extent bytenr can
be decoded directly from the hash table entry (this already exists for
compressed extents). Both of these changes are too large for v0.6.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
One very common case is losing a race to open a file that was deleted.
No need to spam the logs with mere ENOENT reports.
Other errors are more significant. Log those with errno, and
add event counters to record them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The previous commit had both max_transid assigments commented out.
It happens to work because we set max_transid in the constructor and
it doesn't change after that, but it's cleaner to assign it explicitly.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When an extent ref is modified, all of the refs in the same metadata
page get the same transid in the TREE_SEARCH_V2 header. This causes
two problems:
- Extents with generation < min_transid are included if they
happen to be referenced by pages with generation >= min_transid.
- Extent refs with generation > max_transid are excluded even
if they reference extents with generation <= max_transid.
Both of these are wrong: the first causes some extents to be repeatedly
scanned, the second causes some extents to not be scanned at all.
Change the TREE_SEARCH_V2 parameters so that Crawl sees all extents
newer than min_transid (i.e. set max_transid to max). The TREE_SEARCH_V2
kernel logic already operates this way, i.e. it fetches every page with
transid >= min_transid and discards newer items if they are too new for
max_transid. Filter strictly by the extent reference generation field
(i.e. the copy of the extent generation that is in the extent reference).
Note this still scans extent data multiple times, but it should now
be exactly once per extent reference. A proper fix for this requires
extent-based scanning instead of extent-ref-based scanning.
Formerly commit 5a8c655fc447c08772f01107a87e3364f093bb46 "roots: filter
out obsolete extents from extent refs" which landed in the subvol-threads
branch but not master.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Scan the roots tree directly for roots other than 5 (the FS root), and
use btrfs_get_root_transid on root_fd for root 5. This avoids filling
up the root FD cache every time we want a new transid_max. Now the only
reason we open a subvol root FD is to open a file within the subvol.
transid_max may be the same as the FS root's transid, in which case
the search loop is not necessary. Place a counter (transid_max_miss)
to see if we ever need to look at root items. If this counter never goes
above zero, or does so very rarely, we can delete the search loop.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Task should not block for extended periods of time.
Remove the RateEstimator::wait_for() in crawl_roots. When crawl_roots
runs out of data, let the last crawl_task end without rescheduling.
Schedule crawl_task again on transid polls if it was not already running.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BEESLOGNOTE was intended to combine BEESLOG and BEESNOTE, i.e. write a
log message and set the task status message from a single expression.
With the log levels we would now need several more variants
(BEESLOGNOTEDEBUG, BEESLOGNOTEERR...) or a parameter (BEESNOTELOG(DEBUG,
...)).
Or we give up on the idea. This combination was used only 3 times so far.
The log messages and the note message have different editorial styles.
Remove the three instances of BEESLOGNOTE, and make the BEESLOGNOTE
definition equvalent to BEESLOG at LOG_NOTICE level for consistency.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Reword log message for discovery of new toxic extents vs. lookup of
previously known toxic extents. Also add the block data (especially
filename) to the discovery message.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
No public version of bees ever created old-style compressed hash table
entries. Remove the code that supports them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add a third scan mode with alternative trade-offs.
Benefits: Good sequential read performance. Avoids race conditions
described in https://github.com/Zygo/bees/issues/27. Avoids diverting
scan resources into short-lived snapshots before their long-lived
origin subvols are fully scanned.
Drawbacks: Takes the longest time of the three implemented scan-modes
to free space in extents that are shared between snapshots. Uses the
maximum amount of temporary space.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Duplicated code between the different scan modes has slowly been
becoming less and less trivial. Move the code to a method and
make both scan-modes call it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Prealloc extent sizes were taken from the Extent object and did not
take the file size into account. If a file with a non-4K-aligned
size is preallocated, the resulting dedup fails with an exception
because the size of both ranges of the BeesRangePair do not match.
Limit the size of the replacement hole extent to not extend past the
end of the file.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Restartng scans for each transid is a bit aggressive. Scan every 10
transids for a polling rate close to the former BEES_COMMIT_INTERVAL.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
transid_max is now measured at a single point in the crawl_transid thread.
Move the Crawl deferred logic into BeesRoots so it restarts all crawls
when transid_max increases. Gets rid of some messy time arithmetic.
Change name of Crawl thread to "crawl_master" in both thread name and
log messages.
Replace "Next transid" with "Crawl started".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The periodic cache age check was not protected by a lock, so multiple
threads may decide to concurrently clear the cache. This led to
duplicate log messages.
Fix by moving the cache expiry trigger out of FdCache and into Roots,
which knows when transids change and can perform cache clears at exactly
the time they are most relevant, i.e. after something that was deleted
becomes permanently so.
This removes the last references to BEES_COMMIT_INTERVAL, so get rid
of its definition too.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Now that the polling interval is up to 30 times faster,
next_transid seems too verbose again.
Make it clearer that the interval quoted in the "Deferring..."
message is the computed transaction polling interval.
Combine "Next transid" and "Restarted crawl" into a single message.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make the crawl polling interval more closely track the commit interval
on the btrfs filesystem. In the future this will provide opportunities
to do things like clear FD caches and stop crawls on deleted subvols,
but triggered by transaction commits instead of arbitrary time intervals.
Rename the "crawl" thread so it no longer has the same name as the "crawl"
task, and repurpose it for dedicated transid polling. Cancel the deletion
of crawl_thread and repurpose it to trigger new crawls and wake up the
main crawl Task when it runs out of data.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Fix discussion of nodatasum files, clarifying what we can and cannot do.
Get rid of some BEESNOTE and BEESTRACE calls which cannot be observed
(well, BEESNOTE can, but you have to be quick!).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Having too many "write a message to the log" primitives is confusing,
and having one that intermittently and silently discards output is even
_more_ confusing.
Replace all BEESINFO with appropriate BEESLOG*s. Usually DEBUG.
Except for one or two that occur too often. Just delete those.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The data field of BeesBlockData is only interesting to those who want
to debug the BeesBlockData implementation or other battle-tested parts
of bees. Users who want to do this can modify and rebuild the source
to enable the output.
To everyone else, the data field is a huge, ongoing infoleak through
the log.
Don't bother with an option, just output the length of the data field
and nothing else.
Fixes: https://github.com/Zygo/bees/issues/53
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Since we are now unconditionally rendering the print_fn as a static
string, there is no need for it to be a function. We also need it to
be brief and mostly constant.
Use a string instead. Put the string before the function in the Task
constructor arguments so that the title string appears as a heading in
code, since we are making a breaking API change already.
Drop TASK_MACRO as it is broken by this change, but there is no similar
usage of Task anywhere to make it worth fixing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Move pthread_setname_np to the same place we do pthread_getname_np.
Detect errors in pthread_getname_np--but don't throw an exception
because we would call ourself recursively from the exception handler
when it tries to log the exception.
Fix the order of set_name and the first BEESNOTE/BEESLOG call in threads,
closing small time intervals where logs have the wrong thread name,
and that wrong name becomes persistent for the thread.
Make the main thread's name "bees" because Linux kernel stack traces use
the pthread name of the main thread instead of the name of the process.
Anonymous threads get the process name (usually "bees"). We should not
have any such threads, but we do. This appears to occur mostly during
exception stack unwinding. GCC/pthread bug?
Fixes: https://github.com/Zygo/bees/issues/51
Signed-off-by: Zygo Blaxell <bees@furryterror.org>