Dedupe is not possible on a subvol where a btrfs send is running:
BTRFS warning (device dm-22): cannot deduplicate to root 259417 while send operations are using it (1 in progress)
btrfs informs a process with EAGAIN that a dedupe could not be performed
due to a running send operation.
It would be possible to save the crawler state at the affected point,
fork a new crawler that avoids the subvol under send, and resume the
crawler state after a successful dedupe is detected; however, this only
helps the intersection of the set of users who have unrelated subvols
that don't share extents, and the set of users who cannot simply delay
dedupe until send is finished. The simplest approach is to simply stop
and wait until the send goes away.
The simplest approach is taken here. When a dedupe fails with EAGAIN,
affected Tasks will poll, approximately once per transaction, until the
dedupe succeeds or fails with a different error.
bees dedupe performance corresponds with the availability of subvols that
can accept dedupe requests. While the dedupe is paused, no new Tasks can
be performed by the worker thread. If subvols are small and isolated
from the bulk of the filesystem data, the result will be a small but
partial loss of dedupe performance during the send as some worker threads
get stuck on the sending subvol. If subvols heavily share extents with
duplicate data in other subvols, worker threads will all become blocked,
and the entire bees process will pause until at least some of the running
sends terminate.
During the polling for btrfs send, the dedupe Task will hold its dst
file open. This open FD won't interfere with snapshot or file delete
because send subvols are always read-only (it is not possible to delete
a file on a RO subvol, open or otherwise) and send itself holds the
affected subvol open, preventing its deletion. Once the send terminates,
the dedupe will terminate soon after, and the normal FD release can occur.
This pausing during btrfs send is unrelated to the
`--workaround-btrfs-send` option, although `--workaround-btrfs-send` will
cause the pausing to trigger less often. It applies to all scan modes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are no callers of this method any more, and it exposes more
of BeesRoots than we really want things to have access to.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Toxic extents are mostly gone in kernel 5.7 and later. Increase the
timeout for toxic extent handling to reduce false positives, and remove
persistenly stored toxic hashes from the hash table.
Toxic hashes are still stored nonpersistently to help mitigate problems
due to any remaining kernel bugs.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make the progress information more accessible, without having to
enable full debug log and fish it out of the stream with grep.
Also increase the progress log level to INFO.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Replace pointers in the "done" and "total" columns with estimated data
sizes for each size tier. The estimation is based on statistics
collected from extents scanned during the current bees run.
Move the total size for the entire filesystem up to the heading.
Report the _completed_ position (i.e. the one that would be saved in
`beescrawl.dat`), not the _queued_ position (i.e. the one where the
next Task would be created in memory).
At the end of the data, the crawl pointer ends up at some random point
in the filesystem just after the newest extent, so the progress gets to
99.7% and then goes to some random value like 47% or 3%, not to 100%.
Report "deferred" in the "done" column when the crawler is waiting for
the next transid, and "finished" in the "%done" column when the crawler
has reached the end of the data. Suppress the ETA when finished. This
makes it clear that there's no further work to do for these crawlers.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesScanModeExtent uses six scan Tasks instead of one, which leads
to awkwardness like the do_scan method to tell crawl_roots how to do
what it shouldn't need to know how to do anyway.
Move the crawl_roots logic into the ::scan methods themselves.
This also deletes the very popular "crawl_more ran out of data" message.
Extent scan explicitly indicates when a scan is complete, so there's
no longer a need to fish this message out of the log.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The EXTENT scan mode reads the extent tree, splits it into tiers by
extent size, converts each tiers's extents into subvol/inode/offset refs,
then runs the legacy bees dedupe engine on the refs.
The extent scan mode can cheaply compute completion percentage and ETA,
so do that every time a new transid is observed.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A laundry list of problems fixed:
* Track which physical blocks have been read recently without making
any changes, and don't read them again.
* Separate dedupe, split, and hole-punching operations into distinct
planning and execution phases.
* Keep the longest dedupe from overlapping dedupe matches, and flatten
them into non-overlapping operations.
* Don't scan extents that have blocks already in the hash table.
We can't (yet) touch such an extent without making unreachable space.
Let them go.
* Give better information in the scan summary visualization: show dedupe
range start and end points (<ddd>), matching blocks (=), copy blocks
(+), zero blocks (0), inserted blocks (.), unresolved match blocks
(M), should-have-been-inserted-but-for-some-reason-wasn't blocks (i),
and there's-a-bug-we-didn't-do-this-one blocks (#).
* Drop cached data from extents that have been inserted into the hash
table without modification.
* Rewrite the hole punching for uncompressed extents, which apparently
hasn't worked properly since the beginning.
Nuisance dedupe elimination:
* Don't do more than 100 dedupe, copy, or hole-punch operations per
extent ref.
* Don't split an extent or punch a hole unless dedupe would save at
least half of the extent ref's size.
* Write a "skip:" summary showing the planned work when nuisance
dedupe elimination decides to skip an extent.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Originally the limit was 2730 (64KiB worth of ref pointers). This limit
was a little too low for some common workloads, so it was then raised by
a factor of 256 to 699050, but there are a lot of problems with extent
counts that large. Most of those problems are memory usage and speed
problems, but some of them trigger subtle kernel MM issues.
699050 references is too many to be practical. Set the limit to 9999,
only 3-4x larger than the original 2730, to give up on deduplication
when each deduped ref reduces the amount of space by no more than 0.01%.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This solves some of the worst problems with bees reads:
1. The kernel readahead doesn't work. More precisely, it's much better
adapted for a very different use case: a single thread alternating
between reading a file sequentially and processing the data that was read.
bees has multiple threads which compete for access to IO and then issue
reads in random order immediately after the call to readahead. The kernel
uses idle ioprio scheduling for the readaheads, so the readaheads get
preempted by the random reads, or cancels the readaheads because the
data access pattern isn't sequential after the readahead was issued.
2. Seeking drives perform terribly with multiple competing readers,
especially with btrfs striped profiles where the iops are broken into
tiny stripe-sized pieces. At one point I intended to read the btrfs
device map and figure out which devices can be read in parallel, but to
make that useful, the user needs to have an array with multiple drives
in single profile, or 4+ drives in raid1 profile. In all other cases,
the elaborate calculations always return the same result: there can be
only one reader at a time.
This commit fixes both problems:
1. Don't use the kernel readahead. Use normal reads into a dummy
buffer instead.
2. Allow only one thread to readahead at any time. Once the read is
completed, the data is in the page cache, and all the random-order small
reads that bees does will hit the page cache, not a spinning disk.
In some cases we need to read two things close together, so add a
`bees_readahead_pair` which holds one lock across both reads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These were added to crucible all the way back in 2018 (1beb61fb78ba
"crucible: error: record location of exception in what() message")
but it's even more useful in the stack tracer in bees.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Each object contains a 16 MiB buffer, which is very heavy for some
malloc implementations.
Keep the objects in a Pool so that their buffers are only allocated and
deallocated once in the process lifetime.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If the send workaround is enabled, it is possible for two threads (a
thread running the crawl_new task, and a thread attempting to apply the
send workaround) to access the same RootFetcher object at the same time.
That never ends well.
Give each function its own BtrfsRootFetcher object.
Fixes: https://github.com/Zygo/bees/issues/250
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
With SIGTERM and fast exit, the trickle writeback is less important.
We don't want to flood people's IO subsystems with continuous writes.
This really should be configurable at runtime.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When a hash table write fails, we skip over the write throttling because
we didn't report that we successfully wrote an extent. This can be bad
if the filesystem is full and the allocations for writes are burning a
lot of CPU time searching for free space.
We also don't retry the write later on since we assume the extent is
clean after a write attempt whether it was successful or not, so the
extent might not be written out later when writes are possible again.
Check whether a hash extent is dirty, and always throttle after
attempting the write.
If a write fails, leave the extent dirty so we attempt to write it out
the next time flush cycles through the hash table. During shutdown
this will reattempt each failing write once, after that the updated hash
table data will be dropped.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The caller of scan_forward has to stop advancing the BeesFileCrawl
position when an extent lock blocks a scan, so that it will resume
from the same position when the Task is scheduled again; otherwise,
bees simply skips over the extent and leave it incompletely deduped.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Restart crawl_more (and update crawl roots and flush FD caches) every
time the transid changes, and only when the transid changes, but
not more often than a reasonable minimum poll interval.
Clean up the log message: use the proper thread name and remove
the wildly inaccurate estimate of when crawl will resume.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We don't need to cache 65536 extent maps, especially if each one
can have almost 700K references.
Valgrind's massif tool points to the extent map cache as a very
large memory allocator, but test runs with memcg disagree.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we have loadavg targeting enabled, there may be no worker threads
available to respond to new subvols, so we should not bother updating
the subvols list.
Put insert_new_crawl into a Task so it only executes when a worker
is available.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Split each scan mode into two distinct phases:
1. A heavy discovery phase, where we search the entire filesystem
for something (new items in subvol trees in this case).
2. A light consuming phase, where we fetch extents to dedupe
from places that we found in the discovery phase.
Part 1 recomputes the subvol ordering every time there is a new transid.
For some scan modes this computation is quite expensive, far too costly
to pay for every extent, so we do it no more than once per transaction.
Part 2 is run every time a worker thread hits the crawl_more Task.
It simply pulls one extent from the first crawler off a sorted list,
removing the crawler from the list when the crawler runs out of data.
Part 1 creates a new structure and swaps it into place, while Part 2
continues to run using the previous strucuture. Neither of these
need to block the other, so they don't.
The separate class and base pointer also make it easer to add new scan
modes that are not based on subvol trees or that don't use BeesCrawl.
While we're here, fix up some method visibility in BeesRoots.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Set the constructor's default scan mode to an invalid mode, so if we
change the default, we don't have to update two places.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Crawl mode 3 'recent' prioritizes data from new updates to previously
scanned subvols over subvols that have not been completely scanned yet.
If no such new data exists, falls back to a variation of 'lockstep'
scan mode.
This enables us to keep up with new data as it arrives, a key weakness
of all the other scan modes, and worth violating our unwritten "no new
scan modes until we have extent-tree dedupe working" policy for.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When two Tasks attempt to lock the same extent, append the later Task
to the earlier Task's post-exec work queue. This will guarantee that
all Tasks which attempt to manipulate the same extent will execute
sequentially, and free up threads to process other extents.
Similarly, if two scanner threads operate on the same inode, any dedupe
they perform will lock out other scanner threads in btrfs. Avoid this
by serializing Task objects that reference the same file.
This does theoretically use an unbounded amount of memory, but in practice
a Task that encounters a contended extent or inode quickly stops spawning
new Tasks that might increase the queue size, and all Tasks that might
contend for the same lock(s) end up on a single FIFO queue.
Note that the scope of inode locks is intentionally global, i.e. when
an inode is locked, it locks every inode with the same number in every
subvol. This avoids significant lock contention and task queue growth
when the same inode with the same file extents appear in snapshots.
Fixes: https://github.com/Zygo/bees/issues/158
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Split crawlers into two separate Tasks:
1. a Task which locates the next inode with a new data extent.
2. a Task which scans every new extent in that inode.
This simplifies some lock contention and execution ordering issues.
Files are read sequentially. Workers dynamically scale up or
down as needed, without creating thousands of deferred Task objects.
Workers obtain inode locks for different inodes in btrfs, so they
can work in parallel instead of waiting for each other.
This change in behavior comes with new names for the worker Tasks:
"crawl_master" is now "crawl_more", the singular Task which
creates inode-scanning Tasks.
"crawl_<subvol>" is now "crawl_<subvol>_<inode>".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This was done on the development branch three years ago, and
has been creating annoying merge conflicts ever since. Sync
up the branches so they have the same names for these.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Drop the cache since we no longer have to open a file every time we
check a subvol's status.
Also stop counting workaround events at the root level twice.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Kernels that needed the balance workaround frankly are too buggy
to run bees at all. The workaround also makes the locking stories
around logical_ino calls and process exit complicated, so get rid of
it completely.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Quite often bees exceeds its service timeout for termination because
it is waiting for a loop embedded in a Task to finish some long-running
btrfs operation. This can cause bees to be aborted by SIGKILL before
it can completely flush the hash table or save crawl state.
There are only two important things SIGTERM does when bees terminates:
1. Save crawl progress
2. Flush out the hash table
Everything else is automatically handled by the kernel when the process
is terminated by SIGKILL, so we don't have to bother doing it ourselves.
This can save considerable time at shutdown since we don't have to wait
for every thread to reach a point where it becomes idle, or force loops
to terminate by throwing exceptions, or check a condition every time we
access a pointer. Instead, we need do only the things in the list
above, and then call _exit() to clean up everything else.
Hash table and crawl state writeback can happen in their background
threads instead of the foreground one. Separate the "stop" method for
these classes into "stop_request" and "stop_wait" so that these writebacks
can run at the same time.
Deprecate and remove all references to the BeesHalt exception, and remove
several unnecessary checks for BeesContext::stop_requested.
Pause the task queue instead of cancelling it, which preserves the
crawl progress state and stops new Tasks from competing for iops and
CPU during writeback.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In current kernels there is a bug which leads to an infinite loop in
add_all_parents(). The bug is triggered by one thread running dedupe
while another runs logical_ino.
Work around this by ensuring that bees process never runs dedupe and
logical_ino ioctls at the same time. Any number of either can run
at the same time, but not one of both.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
bees_sync() was an exception-trapping wrapper around fsync() which is
not needed in any of the contexts from which it was called:
1. dedupe operations implicitly flush the src data, so there is
no need to call fsync() to do that twice.
2. crawl position is written to a temporary file and renamed
over the original, which always forces a flush when the original
exists. On the first write, where there is no original, a
crash would result in starting over with an empty or hole-filled
beescrawl file, which is the initial state of bees. There is also
a long history of kernel bugs triggered by fsync() in this case.
3. we use unreadahead to trigger writeback for flushing the
hash table to persistent storage. Here is a space where we might
use fsync after all, as part of bees_unreadahead's emulation of
POSIX_FADV_DONTNEED, but we need to get read-once behavior from
the scanner before we can use this capability.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If there's an error while writing the crawl state, the state should
remain dirty. If the crawl state is successfully written, the state
is only clean if there were no changes to crawl state since the write
was committed. We need to release the lock while writing the state but
correctly set the dirty flag when the state is written successfully.
Replace the bool with a version number counter. Track the last version
successfully saved and the current version of the crawl state. The state
is dirty if these counters disagree and clean if they agree.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We had an unfortunate pattern of:
const BeesFileRange bfr;
shared_ptr<BeesContext> ctx;
// ...
BEESNOTE("foo " << bfr);
bfr.fd(ctx);
BEESNOTE("foo after opening: " << bfr);
If dump_status started running after the first BEESNOTE, but before
the second, then bfr.fd() might expose a single Fd object's shared_ptr
member to two threads at the same time (the thread running dump_status
and the thread running BEESNOTE) without protection by a lock. One of
the threads would see a partially-initialized Fd object, and the other
thread would crash on an assertion failure, e.g.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f4c4fde5537 in __GI_abort () at abort.c:79
#2 0x00007f4c4fde540f in __assert_fail_base (fmt=0x7f4c4ff4e128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5557605629dd "!m_destroyed", file=0x5557605627c0 "../include/crucible/namedptr.h", line=77, function=<optimized out>) at assert.c:92
#3 0x00007f4c4fdf4662 in __GI___assert_fail (assertion=assertion@entry=0x5557605629dd "!m_destroyed", file=file@entry=0x5557605627c0 "../include/crucible/namedptr.h", line=line@entry=77,
function=function@entry=0x555760562970 "crucible::NamedPtr<Return, Arguments>::Value::~Value() [with Return = crucible::IOHandle; Arguments = {int}]") at assert.c:101
#4 0x00005557605306f6 in crucible::NamedPtr<crucible::IOHandle, int>::Value::~Value (this=0x7f4a3c2ff0d0, __in_chrg=<optimized out>) at ../include/crucible/namedptr.h:77
#5 0x00005557605137da in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f4a3c2ff0c0) at /usr/include/c++/10/bits/shared_ptr_base.h:151
#6 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f4a3c2ff0c0) at /usr/include/c++/10/bits/shared_ptr_base.h:151
#7 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f4c4c5b5f28, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#8 std::__shared_ptr<crucible::IOHandle, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#9 std::shared_ptr<crucible::IOHandle>::~shared_ptr (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr.h:121
#10 crucible::Fd::~Fd (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at ../include/crucible/fd.h:46
#11 BeesFileRange::file_size (this=0x7f4c4e5ba4a0) at bees-types.cc:156
#12 0x0000555760513950 in operator<< (os=..., bfr=...) at bees-types.cc:80
#13 0x000055576050d662 in std::function<void (std::ostream&)>::operator()(std::ostream&) const (__args#0=..., this=0x7f4c4e5b9f60) at /usr/include/c++/10/bits/std_function.h:622
#14 BeesNote::get_status[abi:cxx11]() () at bees-trace.cc:165
#15 0x00005557604c9676 in BeesContext::dump_status (this=0x5557611c4de0) at bees-context.cc:89
#16 0x00005557605206fb in std::function<void ()>::operator()() const (this=this@entry=0x7f4c4c5b65f0) at /usr/include/c++/10/bits/std_function.h:622
#17 crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) (f=..., explainer=...) at error.cc:55
#18 0x000055576050aaa7 in operator() (__closure=0x5557611c52c8) at bees-thread.cc:22
#19 0x00007f4c501beed0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#20 0x00007f4c502c8ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#21 0x00007f4c4febddef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Fix by making BeesFileRange::m_fd really const (not just mutable),
then fix all the broken code referencing it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It turns out we never set m_dirty's initial value. This is not a
practical problem because 1) it's mostly harmless if m_dirty is spuriously
true, 2) we set it to true every time bees scans a data block, and 3)
the allocation happens early in startup when most memory allocations
are using zero-filled pages, so it's probably getting a false value at
construction in most cases.
valgrind complains about it, so it has to go.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are kernel bugs in LOGICAL_INO from time to time; however, we
can't avoid these bugs by serializing LOGICAL_INO calls.
It hasn't been used for some time, so remove the code and
less-than-completely-accurate comments.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We need random numbers in more places, so centralize the engines.
Initialize with a proper random seed so every worker thread gets
different behavior.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The vector<uint8_t> in the hash table doesn't hurt very much--only a few
microseconds per 128K hash block.
The vector<uint8_t> in BeesBlockData hurts a bit more--we run that
constructor thousands of times per second.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The hash table is one of the few cases in bees where a non-trivial amount
of page cache memory will be used in a predictable way, so we can advise
the kernel about our IO demands in advance.
Use WILLNEED to prefetch hash table pages at startup.
Use DONTNEED to trigger writeback on hash table pages at shutdown.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
LOGICAL_INO_V2 has a maximum limit of 655050 references per extent.
Although it no longer has a crippling performance problem, at roughly
two seconds to process extent, it's too slow to be useful.
When an extent gains an absurd number of references, stop making any
more. Returning zero extent refs will make bees believe the extent
was deleted, and it will remove the block from the hash table.
This helps speed processing of highly duplicated large files like
VM images, and the cost of a slightly lower dedupe hit rate.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>