set() was broken and redundant. Calling hold() and discarding the
returned object has the correct effect.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The task queue can become very large with many subvols, requiring hours
for the queue to clear. 'beescrawl.dat' saves in the meantime will save
the work currently scheduled, not the work currently completed.
Fix by tracking progress with ProgressTracker. ProgressTracker::begin()
gives the last completed crawl position. ProgressTracker::end() gives
the last scheduled crawl position. begin() does not advance if there
is any item between begin() and end() is not yet completed. In between
are crawled extents that are on the task queue but not yet processed.
The file 'beescrawl.dat' saves the begin() position while the extent
scanning task queue is fed from the end() position.
Also remove an unused method crawl_state_get() and repurpose the
operator<(BeesCrawlState) that nobody was using.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
One very common case is losing a race to open a file that was deleted.
No need to spam the logs with mere ENOENT reports.
Other errors are more significant. Log those with errno, and
add event counters to record them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The previous commit had both max_transid assigments commented out.
It happens to work because we set max_transid in the constructor and
it doesn't change after that, but it's cleaner to assign it explicitly.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When an extent ref is modified, all of the refs in the same metadata
page get the same transid in the TREE_SEARCH_V2 header. This causes
two problems:
- Extents with generation < min_transid are included if they
happen to be referenced by pages with generation >= min_transid.
- Extent refs with generation > max_transid are excluded even
if they reference extents with generation <= max_transid.
Both of these are wrong: the first causes some extents to be repeatedly
scanned, the second causes some extents to not be scanned at all.
Change the TREE_SEARCH_V2 parameters so that Crawl sees all extents
newer than min_transid (i.e. set max_transid to max). The TREE_SEARCH_V2
kernel logic already operates this way, i.e. it fetches every page with
transid >= min_transid and discards newer items if they are too new for
max_transid. Filter strictly by the extent reference generation field
(i.e. the copy of the extent generation that is in the extent reference).
Note this still scans extent data multiple times, but it should now
be exactly once per extent reference. A proper fix for this requires
extent-based scanning instead of extent-ref-based scanning.
Formerly commit 5a8c655fc447c08772f01107a87e3364f093bb46 "roots: filter
out obsolete extents from extent refs" which landed in the subvol-threads
branch but not master.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Scan the roots tree directly for roots other than 5 (the FS root), and
use btrfs_get_root_transid on root_fd for root 5. This avoids filling
up the root FD cache every time we want a new transid_max. Now the only
reason we open a subvol root FD is to open a file within the subvol.
transid_max may be the same as the FS root's transid, in which case
the search loop is not necessary. Place a counter (transid_max_miss)
to see if we ever need to look at root items. If this counter never goes
above zero, or does so very rarely, we can delete the search loop.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Task should not block for extended periods of time.
Remove the RateEstimator::wait_for() in crawl_roots. When crawl_roots
runs out of data, let the last crawl_task end without rescheduling.
Schedule crawl_task again on transid polls if it was not already running.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add a third scan mode with alternative trade-offs.
Benefits: Good sequential read performance. Avoids race conditions
described in https://github.com/Zygo/bees/issues/27. Avoids diverting
scan resources into short-lived snapshots before their long-lived
origin subvols are fully scanned.
Drawbacks: Takes the longest time of the three implemented scan-modes
to free space in extents that are shared between snapshots. Uses the
maximum amount of temporary space.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Duplicated code between the different scan modes has slowly been
becoming less and less trivial. Move the code to a method and
make both scan-modes call it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Restartng scans for each transid is a bit aggressive. Scan every 10
transids for a polling rate close to the former BEES_COMMIT_INTERVAL.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
transid_max is now measured at a single point in the crawl_transid thread.
Move the Crawl deferred logic into BeesRoots so it restarts all crawls
when transid_max increases. Gets rid of some messy time arithmetic.
Change name of Crawl thread to "crawl_master" in both thread name and
log messages.
Replace "Next transid" with "Crawl started".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The periodic cache age check was not protected by a lock, so multiple
threads may decide to concurrently clear the cache. This led to
duplicate log messages.
Fix by moving the cache expiry trigger out of FdCache and into Roots,
which knows when transids change and can perform cache clears at exactly
the time they are most relevant, i.e. after something that was deleted
becomes permanently so.
This removes the last references to BEES_COMMIT_INTERVAL, so get rid
of its definition too.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Now that the polling interval is up to 30 times faster,
next_transid seems too verbose again.
Make it clearer that the interval quoted in the "Deferring..."
message is the computed transaction polling interval.
Combine "Next transid" and "Restarted crawl" into a single message.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make the crawl polling interval more closely track the commit interval
on the btrfs filesystem. In the future this will provide opportunities
to do things like clear FD caches and stop crawls on deleted subvols,
but triggered by transaction commits instead of arbitrary time intervals.
Rename the "crawl" thread so it no longer has the same name as the "crawl"
task, and repurpose it for dedicated transid polling. Cancel the deletion
of crawl_thread and repurpose it to trigger new crawls and wake up the
main crawl Task when it runs out of data.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Fix discussion of nodatasum files, clarifying what we can and cannot do.
Get rid of some BEESNOTE and BEESTRACE calls which cannot be observed
(well, BEESNOTE can, but you have to be quick!).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Having too many "write a message to the log" primitives is confusing,
and having one that intermittently and silently discards output is even
_more_ confusing.
Replace all BEESINFO with appropriate BEESLOG*s. Usually DEBUG.
Except for one or two that occur too often. Just delete those.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Since we are now unconditionally rendering the print_fn as a static
string, there is no need for it to be a function. We also need it to
be brief and mostly constant.
Use a string instead. Put the string before the function in the Task
constructor arguments so that the title string appears as a heading in
code, since we are making a breaking API change already.
Drop TASK_MACRO as it is broken by this change, but there is no similar
usage of Task anywhere to make it worth fixing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Silence the three(!) log messages per crawl increment an extra one at
the end of the subvol.
The three critical messages per subvol crawl cycle are:
Next transid in BeesCrawlState <SUBVOL>:0 offset 0x0 transid <A>..<B> started <T> (<AGO>s ago)
Subvol has been completely scanned and a new transaction range will
be created. CrawlState is the state of the old subvol.
Restarted crawl BeesCrawlState <SUBVOL>:0 offset 0x0 transid <B>..<C> started <T+AGO> (0s ago)
Subvol has been restarted. CRawlState is the state of the new subvol.
Deferring next transid in BeesCrawlState <SUBVOL>:0 offset 0x0 transid <B>..<C> started <T+AGO> (0s ago)
Subvol has been completely scanned, but it is too soon to start a
new scan.
Fix the "Restart..." message to use the correct verb tense and to use
the correct BeesCrawlState data.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit adds log levels to the output. In systemd, it makes colored
lines, otherwise it's probably just a number. Bees is very chatty, so
this paves the road for log level filtering.
Signed-off-by: Kai Krakow <kai@kaishome.de>
There are two subvol scan algorithms implemented so far. The two modes
are unimaginatively named 0 and 1.
0: sorts extents by (inode, subvol, offset),
1: scans extents round-robin from all subvols.
Algorithm 0 scans references to the same extent at close to the same
time, which is good for performance; however, whenever a snapshot is
created, the scan of the entire filesystem restarts at the beginning of
the new snapshot.
Algorithm 1 makes continuous forward progress even when new snapshots
are created, but it does not benefit from caching and will force the
kernel to reread data multiple times when there are snapshots.
The algorithm can be selected at run-time using the -m or --scan-mode
option.
We can collect some field data on these before replacing them with
an extent-tree-based scanner. Alternatively, for pre-4.14 kernels,
we can keep these two modes as non-default options.
Currently these algorithms have terrible names. TODO: fix that, but
also TODO: delete all that code and do scans directly from the extent
tree instead.
Augment the scan algorithms relative to their earlier implementation by
batching multiple extents to scan from each subvol before switching to
a different subvol.
Sprinkle some BEESNOTEs on the Task objects so that they don't
disappear from the thread status output.
Adjust some timing constants to deal with the increased latency from
competing threads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Distribute incoming extents across a thread pool for faster execution
on multi-core, multi-disk environments.
Switch extent enumeration model to scan extent refs consecutively(ish).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In both instances the code contained within (or the conditional
compilation surrounding it) is no longer controversial.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BEESNOTE puts a message on the status message stack. BEESINFO logs a
message with rate limiting. The message that was flooding the logs
was coming from BEESINFO not BEESNOTE.
Fix earlier commit which removed the wrong message.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
After a few hundred subvol threads start running, the inode cache starts
to thrash, and the log gets spammed with messages of the form:
"open_root_nocache <subvolid>: <path>"
Ideally there would be some way to schedule work to minimize inode
thrashing. Until that gets done, just silence the messages for now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we lose a race and open the wrong file, we will not retry with the
next path if the file we opened had incompatible flags. We need to keep
trying paths until we open the correct file or run out of paths.
Fix by moving the inode flag check after the checks for file identity.
Output attributes in hex to be consistent with other attribute error
messages.
There is no need to report root and file paths separately in the error
message for incompatible flags because we have confirmed the identity of
the file before the incompatible flag error is detected. Other messages
in this loop still output root path and file_path separately because
the identity of 'rv' is unknown at the time these messages are emitted.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If you have a lot of or a few big nocow files (like vm images) which
contain a lot of potential deduplication candidates, bees becomes
incredibly slow running through a lot "invalid operation" exceptions.
Let's just skip over such files to get more bang for the buck. I did no
regression testing as this patch seems trivial (and I cannot imagine any
pitfalls either). The process progresses much faster for me now.
In gcc 7+ warning: implicit-fallthrough has been added
In some places fallthrough is expectable, disable warning
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Before:
unique_lock<mutex> lock(some_mutex);
// run lock.~unique_lock() because return
// return reference to unprotected heap
return foo[bar];
After:
unique_lock<mutex> lock(some_mutex);
// make copy of object on heap protected by mutex lock
auto tmp_copy = foo[bar];
// run lock.~unique_lock() because return
// pass locally allocated object to copy constructor
return tmp_copy;
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Previously, the scan order processed each subvol in order. This required
very large amounts of temporary disk space, as a full filesystem scan
was required before any shared extents could be deduped. If the hash
table RAM was underprovisioned this would mean some shared dup blocks
were removed from the hash table before they could be deduped.
Currently the scan order takes the first unscanned extent from each
subvol. This works well if--and only if--the subvols are either empty
or children of a common ancestor. It forces the same inode/offset pairs
to be read at close to the same time from each subvol.
When a new snapshot is created, this ordering diverts scanning to the
new subvol until it catches up to the existing subvols. For large
filesystems with frequent snapshot creation this means that the scanner
never reaches the end of all subvols. Each new subvol effectively
resets the current scan position for the entire filesystem to zero.
This prevents bees from ever completing the first filesystem scan.
Change the order again, so that we now read one unscanned extent from
each subvol in round-robin fashion. When a new subvol is created, we
share scan time between old and new subvols. This ensures we eventually
finish scanning initial subvols and enter the incremental scanning state.
The cost of this change is more repeated reading of shared extents at
scan time with less benefit from disk-device-level caching; however, the
only way to really fix this problem is to implement scanning on tree 2
(the btrfs extent tree) instead of the subvol trees.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The thread name has an arbitrarily limited size, and we are eventually
removing support for multiple paths in a single bees daemon process.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We really do need some large buffers for BtrfsIoctlSearchKey in some
cases, but we don't need to zero them out first. Don't do that so we
save some CPU.
Reduce the default buffer size to 4K because most BISK users don't get
need much more than 1K. Set the buffer size explicitly to the product of
the number of items and the desired item size in the places that really
need a lot of items.
Linux kernel commit 7f8e406 ("btrfs: improve delayed refs iterations")
seems to dramatically improve LOGICAL_INO performance. Hopefully this
commit will find its way into mainline Linux soon.
This means that most of the time in Bees is now spent on block reading
(50-75%); however, there is still a big gap between block read and
the sum of everything else we are measuring with the "*_ms" counters.
This gap is about 30% of the run time, so it would be good to find out
what's in the gap.
Add ms counters around the crawl and open calls to capture where we are
spending all the time.