Restart crawl_more (and update crawl roots and flush FD caches) every
time the transid changes, and only when the transid changes, but
not more often than a reasonable minimum poll interval.
Clean up the log message: use the proper thread name and remove
the wildly inaccurate estimate of when crawl will resume.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We don't need to cache 65536 extent maps, especially if each one
can have almost 700K references.
Valgrind's massif tool points to the extent map cache as a very
large memory allocator, but test runs with memcg disagree.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we have loadavg targeting enabled, there may be no worker threads
available to respond to new subvols, so we should not bother updating
the subvols list.
Put insert_new_crawl into a Task so it only executes when a worker
is available.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
On large filesystems where the min_transid of all subvols gets stuck at 0,
bees may lose the ability to effectively track recent data. A secondary sort
by max_transid will allow scanning newer subvols that were created after bees
started running on the filesystem, but before bees completed the first scan
of all subvols.
On the other hand, the secondary sort does a reverse version of the
sequential scan mode, and the sequential scan mode is simply awful.
Disable it for now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Also attempted to clarify the descriptions of the modes based on
feedback and questions from users over the years.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Split each scan mode into two distinct phases:
1. A heavy discovery phase, where we search the entire filesystem
for something (new items in subvol trees in this case).
2. A light consuming phase, where we fetch extents to dedupe
from places that we found in the discovery phase.
Part 1 recomputes the subvol ordering every time there is a new transid.
For some scan modes this computation is quite expensive, far too costly
to pay for every extent, so we do it no more than once per transaction.
Part 2 is run every time a worker thread hits the crawl_more Task.
It simply pulls one extent from the first crawler off a sorted list,
removing the crawler from the list when the crawler runs out of data.
Part 1 creates a new structure and swaps it into place, while Part 2
continues to run using the previous strucuture. Neither of these
need to block the other, so they don't.
The separate class and base pointer also make it easer to add new scan
modes that are not based on subvol trees or that don't use BeesCrawl.
While we're here, fix up some method visibility in BeesRoots.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Set the constructor's default scan mode to an invalid mode, so if we
change the default, we don't have to update two places.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Crawl mode 3 'recent' prioritizes data from new updates to previously
scanned subvols over subvols that have not been completely scanned yet.
If no such new data exists, falls back to a variation of 'lockstep'
scan mode.
This enables us to keep up with new data as it arrives, a key weakness
of all the other scan modes, and worth violating our unwritten "no new
scan modes until we have extent-tree dedupe working" policy for.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Inode-oriented scan workers must do all of their work sequentially,
so it's counterproductive to spawn a Task to do a background dedupe.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When two Tasks attempt to lock the same extent, append the later Task
to the earlier Task's post-exec work queue. This will guarantee that
all Tasks which attempt to manipulate the same extent will execute
sequentially, and free up threads to process other extents.
Similarly, if two scanner threads operate on the same inode, any dedupe
they perform will lock out other scanner threads in btrfs. Avoid this
by serializing Task objects that reference the same file.
This does theoretically use an unbounded amount of memory, but in practice
a Task that encounters a contended extent or inode quickly stops spawning
new Tasks that might increase the queue size, and all Tasks that might
contend for the same lock(s) end up on a single FIFO queue.
Note that the scope of inode locks is intentionally global, i.e. when
an inode is locked, it locks every inode with the same number in every
subvol. This avoids significant lock contention and task queue growth
when the same inode with the same file extents appear in snapshots.
Fixes: https://github.com/Zygo/bees/issues/158
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Split crawlers into two separate Tasks:
1. a Task which locates the next inode with a new data extent.
2. a Task which scans every new extent in that inode.
This simplifies some lock contention and execution ordering issues.
Files are read sequentially. Workers dynamically scale up or
down as needed, without creating thousands of deferred Task objects.
Workers obtain inode locks for different inodes in btrfs, so they
can work in parallel instead of waiting for each other.
This change in behavior comes with new names for the worker Tasks:
"crawl_master" is now "crawl_more", the singular Task which
creates inode-scanning Tasks.
"crawl_<subvol>" is now "crawl_<subvol>_<inode>".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This was done on the development branch three years ago, and
has been creating annoying merge conflicts ever since. Sync
up the branches so they have the same names for these.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Drop the cache since we no longer have to open a file every time we
check a subvol's status.
Also stop counting workaround events at the root level twice.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
btrfs-tree provides classes for low-level access to btrfs tree objects.
An item class is provided to decode polymorphic btrfs item fields.
Several tree classes provide forward and backward iteration over raw
object items at different tree levels.
A csum tree class provides convenient access to csums by bytenr,
supporting all current btrfs csum types.
Wrapper classes for inode and subvol items provide direct access to
btrfs metadata fields without clumsy stat() wrappers or ioctls.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This template turns a forward search primitive (e.g. lower_bound, FIEMAP,
TREE_SEARCH_V2) into a backward search primitive.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We are using ByteVectors from multiple threads in some cases. Mostly
these are the status and progress threads which read the ByteVector
object references embedded in BEESNOTE macros.
Since it's not clear what the data race implications are, protect
the shared_ptr in ByteVector with a mutex for now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Kernels that needed the balance workaround frankly are too buggy
to run bees at all. The workaround also makes the locking stories
around logical_ino calls and process exit complicated, so get rid of
it completely.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
At some point BtrfsExtentWalker will be fully deprecated and removed from
bees. Might as well start with code that hasn't been built in 6 years.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Quite often bees exceeds its service timeout for termination because
it is waiting for a loop embedded in a Task to finish some long-running
btrfs operation. This can cause bees to be aborted by SIGKILL before
it can completely flush the hash table or save crawl state.
There are only two important things SIGTERM does when bees terminates:
1. Save crawl progress
2. Flush out the hash table
Everything else is automatically handled by the kernel when the process
is terminated by SIGKILL, so we don't have to bother doing it ourselves.
This can save considerable time at shutdown since we don't have to wait
for every thread to reach a point where it becomes idle, or force loops
to terminate by throwing exceptions, or check a condition every time we
access a pointer. Instead, we need do only the things in the list
above, and then call _exit() to clean up everything else.
Hash table and crawl state writeback can happen in their background
threads instead of the foreground one. Separate the "stop" method for
these classes into "stop_request" and "stop_wait" so that these writebacks
can run at the same time.
Deprecate and remove all references to the BeesHalt exception, and remove
several unnecessary checks for BeesContext::stop_requested.
Pause the task queue instead of cancelling it, which preserves the
crawl progress state and stops new Tasks from competing for iops and
CPU during writeback.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Dump the instantaneous load (last 5 seconds, extracted from load average)
and the computed target worker count (before rounding and truncation)
on the same status line as the task and worker thread count.
This should give better visibility into Task's thread count calculation
algorithm.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Tasks are often running longer than 5 seconds (especially extents with
multiple references requiring copy operations), so the load tracking
algorithm needs to average several samples over a longer period of time
than 5 seconds. If the sample period is 60 seconds, we end up recomputing
the original load average from current_load, so skip the rounding error
and use the original load average value.
Arguably the real fix is to break up the more complex extent operations
over several downstream Task objects, but that's a more significant
design change.
Tweak the attack and decay rates so that threads are started a little
more slowly, but still stopped rapidly when load spikes up.
Remove the hysteresis to provide support for load average targets
below 1, or with fractional components, with a PWM-like effect.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
task1.append(task2) is supposed to run task2 after task1 is executed;
however, if task1 was just executed, and its last reference was owned by
a TaskConsumer, then task2 will be appended to a Task that will never
run again.
A similar problem arises in Exclusion, which can cause blocked tasks
to occasionally be dropped without executing them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This was resulting in an assertion failure later on if a queue was
being rescued from a deleted task with only one post-exec queue.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
pause(true) stops the TaskMaster from processing any more Tasks,
but does not destroy any queued Tasks.
pause(false) re-enables Task processing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In the event that someday Barrier allows users to force execution of
its pending tasks prior to the destruction of the BarrierState object,
we'll be ready to submit those Tasks for execution without waiting for
the BarrierState mutex lock.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Exclusion was generating a new Task every time a lock was contended.
That results in thousands of empty Task objects which contain a single
Task item.
Get rid of ExclusionState. Exclusion is now a simple weak_ptr to a Task.
If the weak_ptr is expired, the Exclusion is unlocked. If the weak_ptr
is not expired, it points to the Task which owns the Exclusion.
try_lock now appends the Task attempting to lock the Exclusion directly
to the owning Task, eliminating the need for Exclusion to have one.
This also removes the need to call insert_task separately, though
insert_task remains for other use cases.
With no ExclusionState there is no need for a string argument to
Exclusion's constructor, so get rid of that too.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make one class Barrier which is copiable, so we don't have to
have users making shared Barrier all the time.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It seems that readahead() does not work on btrfs, or at least it has
no discernable effect. Enable the workaround instead.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In current kernels there is a bug which leads to an infinite loop in
add_all_parents(). The bug is triggered by one thread running dedupe
while another runs logical_ino.
Work around this by ensuring that bees process never runs dedupe and
logical_ino ioctls at the same time. Any number of either can run
at the same time, but not one of both.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
For performance or workaround reasons we sometimes have to avoid doing
two conflicting operations at the same time, but we can still run any
number of non-conflicting operations in parallel.
MultiLocker (suggestions for a better class name welcome) blocks the
calling thread until there are no threads attempting to run a conflicting
operation.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
bees_sync() was an exception-trapping wrapper around fsync() which is
not needed in any of the contexts from which it was called:
1. dedupe operations implicitly flush the src data, so there is
no need to call fsync() to do that twice.
2. crawl position is written to a temporary file and renamed
over the original, which always forces a flush when the original
exists. On the first write, where there is no original, a
crash would result in starting over with an empty or hole-filled
beescrawl file, which is the initial state of bees. There is also
a long history of kernel bugs triggered by fsync() in this case.
3. we use unreadahead to trigger writeback for flushing the
hash table to persistent storage. Here is a space where we might
use fsync after all, as part of bees_unreadahead's emulation of
POSIX_FADV_DONTNEED, but we need to get read-once behavior from
the scanner before we can use this capability.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If there's an error while writing the crawl state, the state should
remain dirty. If the crawl state is successfully written, the state
is only clean if there were no changes to crawl state since the write
was committed. We need to release the lock while writing the state but
correctly set the dirty flag when the state is written successfully.
Replace the bool with a version number counter. Track the last version
successfully saved and the current version of the crawl state. The state
is dirty if these counters disagree and clean if they agree.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Document the overall purpose of the class and what some of the methods do,
particularly the ones with terrible names like 'insert_item' (which only
inserts an item after calling the Function).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>