1
0
mirror of https://github.com/Zygo/bees.git synced 2025-08-01 13:23:28 +02:00

158 Commits

Author SHA1 Message Date
Zygo Blaxell
e9d4aa4586 roots: make the "idle" label useful
Apply the "idle" label only when the crawl is finished _and_ its
transid_max is up to date.  This makes the keyword "idle" better reflect
when bees is not only finished crawling, but also scanning the crawled
extents in the queue.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 23:06:14 -04:00
Zygo Blaxell
504f4cda80 progress: move the "idle" cell to the next cycle ETA column
When all extents within a size tier have been queued, and all the
extents belong to the same file, the queue might take a long time to
fully process.  Also, any progress that is made will be obscured by
the "idle" tag in the "point" column.

Move "idle" to the next cycle ETA column, since the ETA duration will
be zero, and no useful information is lost since we would have "-"
there anyway.

Since the "point" column can now display the maximum value, lower
that maximum to 999999 so that we don't use an extra column.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 22:33:05 -04:00
Zygo Blaxell
6c36f4973f extent scan: log the bfr when removing a prealloc extent
With subvol scan, the crawl task name is the subvol/inode pair
corresponding to the file offset in the log message.  The identity of
the file can be determined by looking up the subvol/inode pair in the
log message.

With extent scan, the crawl task name is the extent bytenr corresponding
to the file offset in the log message.  This extent is deleted when the
log message is emitted, so a later lookup on the extent bytenr will not
find any references to the extent, and the identity of the file cannot
be determined.

Log the bfr, which does a /proc lookup on the name of the fd, so the
filename is logged.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 22:33:05 -04:00
Zygo Blaxell
b1bd99c077 seeker: harden against changes in the data during binary search
During the search, the region between `upper_bound` and `target_pos`
should contain no data items.  The search lowers `upper_bound` and raises
`lower_bound` until they both point to the last item before `target_pos`.

The `lower_bound` is increased to the position of the last item returned
by a search (`high_pos`) when that item is lower than `target_pos`.
This avoids some loop iterations compared to a strict binary search
algorithm, which would increase `lower_bound` only as far as `probe_pos`.

When the search runs over live extent items, occasionally a new extent
will appear between `upper_bound` and `target_pos`.  When this happens,
`lower_bound` is bumped up to the position of one of the new items, but
that position is in the "unoccupied" space between `upper_bound` and
`target_pos`, where no items are supposed to exist, so `seek_backward`
throws an exception.

To cut down on the noise, only increase `lower_bound` as far as
`upper_bound`.  This avoids the exception without increasing the number
of loop iterations for normal cases.

In the exceptional cases, extra loop iterations are needed to skip over
the new items.  This raises the worst-case number of loop iterations
by one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
d5e805ab8d seeker: add a real-world test case
This seek_backward failed in bees because an extent appeared during
the search:

	fetch(probe_pos = 6821971036, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821971004 = probe_pos - have_delta 32 (want_delta 32)
	fetch(probe_pos = 6821971004, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970972 = probe_pos - have_delta 32 (want_delta 32)
	fetch(probe_pos = 6821970972, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970908 = probe_pos - have_delta 64 (want_delta 64)
	fetch(probe_pos = 6821970908, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970780 = probe_pos - have_delta 128 (want_delta 128)
	fetch(probe_pos = 6821970780, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970524 = probe_pos - have_delta 256 (want_delta 256)
	fetch(probe_pos = 6821970524, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970012 = probe_pos - have_delta 512 (want_delta 512)
	fetch(probe_pos = 6821970012, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821968988 = probe_pos - have_delta 1024 (want_delta 1024)
	fetch(probe_pos = 6821968988, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821966940 = probe_pos - have_delta 2048 (want_delta 2048)
	fetch(probe_pos = 6821966940, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821962844 = probe_pos - have_delta 4096 (want_delta 4096)
	fetch(probe_pos = 6821962844, target_pos = 6821971036)
	 = 6821962845..6821962848
	found_low = true, lower_bound = 6821962845
	lower_bound = high_pos 6821962848
	loop: lower_bound 6821962848, probe_pos 6821966942, upper_bound 6821971036
	fetch(probe_pos = 6821966942, target_pos = 6821971036)
	 = 6822575316..6822575316
	upper_bound = probe_pos 6821966942
	loop: lower_bound 6821962848, probe_pos 6821964895, upper_bound 6821966942
	fetch(probe_pos = 6821964895, target_pos = 6821971036)
	 = 6822575316..6822575316
	upper_bound = probe_pos 6821964895
	loop: lower_bound 6821962848, probe_pos 6821963871, upper_bound 6821964895
	fetch(probe_pos = 6821963871, target_pos = 6821971036)
	 = 6822575316..6822575316
	upper_bound = probe_pos 6821963871
	loop: lower_bound 6821962848, probe_pos 6821963359, upper_bound 6821963871
	fetch(probe_pos = 6821963359, target_pos = 6821971036)
	 = 6821963411..6821963422
	lower_bound = high_pos 6821963422
	loop: lower_bound 6821963422, probe_pos 6821963646, upper_bound 6821963871
	fetch(probe_pos = 6821963646, target_pos = 6821971036)
	 = 6822575316..6822575316

Here, we found nothing between 6821963646 and 6822575316, so upper_bound is reduced
to 6821963646...

	upper_bound = probe_pos 6821963646
	loop: lower_bound 6821963422, probe_pos 6821963534, upper_bound 6821963646
	fetch(probe_pos = 6821963534, target_pos = 6821971036)
	 = 6821963536..6821963539
	lower_bound = high_pos 6821963539
	loop: lower_bound 6821963539, probe_pos 6821963592, upper_bound 6821963646
	fetch(probe_pos = 6821963592, target_pos = 6821971036)
	 = 6821963835..6821963841

...but here, we found 6821963835 and 6821963841, which are between
6821963646 and 6822575316.  They were not there before, so the binary
search result is now invalid because new extent items were added while
it was running.  This results in an exception:

	lower_bound = high_pos 6821963841
	--- BEGIN TRACE --- exception ---
	objectid = 27942759813120, adjusted to 27942793363456 at bees-roots.cc:1103
	Crawling extent BeesCrawlState 250:0 offset 0x0 transid 1311734..1311735 at bees-roots.cc:991
	get_state_end at bees-roots.cc:988
	find_next_extent 250 at bees-roots.cc:929
	---  END  TRACE --- exception ---
	*** EXCEPTION ***
	exception type std::out_of_range: lower_bound = 6821963841, upper_bound = 6821963646 failed constraint check (lower_bound <= upper_bound) at ../include/crucible/seeker.h:139

The exception prevents the result of seek_backward from returning a value,
which prevents a nonsense result from a consumer of that value.

Copy the details of this search into a test case.  Note that the test
case won't reproduce the exception because the simulation of fetch()
is not changing the results part way through.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
337bbffac1 extent scan: drop a nonsense trace message
This message appears only during exception backtraces, but it doesn't
carry any useful information.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
527396e5cb extent scan: integrate seeker debug output stream
Send both tree_search ioctl and `seek_backward` debug logs to the
same output stream, but only write that stream to the debug log if
there is an exception.

The feature remains disabled at compile time.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
bc7c35aa2d extent scan: only write a detailed debug log when there's an exception
Note that when enabled, the logs are still very CPU-intensive,
but most of the logs will be discarded.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
0953160584 trace: export exception_check
We need to call this from more than one place in bees.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
80f9c147f7 btrfs-tree: clean up the fetch function's return set
Commit d32f31f411 ("btrfs-tree: harden
`rlower_bound` against exceptional objects") passes the first btrfs item
in the result set that is above upper_bound up to `seek_backward`.
This is somewhat wasteful as `seek_backward` cannot use such a result.

Reverse that change in behavior, while keeping the rest of the other
commit.

This introduces a new case, where the search ioctl is producing items
that are above upper bound, but there are no items in the result set,
which continues looping until the end of the filesystem is reached.
Handle that by setting an explicit exit variable.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
50e012ad6d seeker: add a runtime debug stream
This allows detailed but selective debugging when using the library,
particularly when something goes wrong.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
9a9644659c trace: clean up the formatting around top-level exception log messages
Fewer newlines.  More consistent application of the "TRACE:" prefix.
All at the same log level.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
fd53bff959 extent scan: drop out-of-date comment
The comment describes an earlier version which submitted each extent
ref as a separate Task, but now all extent refs are handled by the same
Task to minimize the amount of time between processing the first and
last reference to an extent.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
9439dad93a extent scan: extra check to make sure no Tasks are started when throttled
Previously `scan()` would run the extent scan loop once, and enqueue one
extent, before checking for throttling.  Do an extra check before that,
and bail out so that zero extents are enqueued when throttled.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
ef9b4b3a50 extent scan: shorten task name for extent map
Linux kernel thread names are hardcoded at 16 characters.  Every character
counts, and "0x" wastes two.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
7ca857dff0 docs: add the ghost subvols bug to the bugs list
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
8331f70db7 progress: fix ETA calculations
The "tm_left" field was the estimated _total_ duration of the crawl,
not the amount of time remaining.  The ETA timestamp was then calculated
based on the estimated time to run the crawl if it started _now_, not
at the start timestamp.

Fix the duration and ETA calculations.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Steven Allen
a844024395 Make the runtime directory private
The status file contains sensitive information like filenames and duplicate chunk ranges. It might also make sense to set the process-wide `UMask=`, but that may have other unintended side effects.
2025-03-26 15:02:42 +00:00
Zygo Blaxell
47243aef14 hash: handle $BEESHOME on btrfs too
The `_nothrow` variants of `do_ioctl` return true when they succeed,
which is the opposite of what `ioctl` does.

Fix the logic so bees can correctly identify its own hash table when
it's on the same filesystem as the target.

Fixes: f6908420ad ("hash: handle $BEESHOME on non-btrfs")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-17 21:18:08 -05:00
Zygo Blaxell
a670aa5a71 extent scan: don't divide by zero if there were no loops
Commit 183b6a5361 ("extent scan: refactor
BeesCrawl, BeesScanMode*") moved some statistics calculations out of
the loop in `find_next_extent`, but did not ensure that the statistics
would not be calculated if the loop had not executed any iterations.

In rare instances, the function returns without entering the loop at all,
which results in divide by zero.  Add a check just before doing that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
51b3bcdbe4 trace: deprecate BEESLOGTRACE, align trace logs with exception notices
Exceptions were logged at level NOTICE while the stack traces were logged
at level DEBUG.  That produced useless noise in the output with `-v5`
or `-v6`, where there were exception headings logged, but no details.

Fix that by placing the exceptions and traces at level DEBUG, but prefix
them with `TRACE:` for easy grepping.

Most of the events associated with BEESLOGTRACE either never happen,
or they are harmless (e.g. trying to open deleted files or subvols).
Reassign them to ordinary BEESLOGDEBUG, with one exception for
unrecognized Extent flags that should be debugged if any appear.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
ae58401d53 trace: avoid one copy in every trace function
While investigating https://github.com/Zygo/bees/issues/282 I noticed that
we're doing at least one unnecessary extra copy of the functor in BEESTRACE.
Get rid of it with a const reference.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
3e7eb43b51 BeesStringFile: figure out when to call--or _not_ call--fsync
Older kernel versions featured some bugs in btrfs `fsync`, which could
leave behind "ghost dirents", orphan filename items that did not have
a corresponding inode.  These dirents were created during log replay
during the first mount after a crash due to several different bugs in
the log tree and its use over the years.  The last known bug of this
kind was fixed in kernel 5.16.  As of this writing, no fixes for this
bug have been backported to any earlier LTS kernel.

Some filesystems, including btrfs, will flush the contents of a new
file before renaming it over an old file.  On paper, btrfs can do this
very cheaply since the contents of the new file are not referenced, and
the old file not dereferenced, until a tree commit which includes both
actions atomically; however, in real life, btrfs provides `fsync`-like
semantics and uses the log-tree infrastructure to implement them, which
compromises performance and acts as a magnet for bugs.

The benefit of this trade-off is that `rename` can be used as a
synchronization point for data outside of the btrfs, which would not
happen if everything `rename` does was simply deferred to the next
tree commit.  The cost of this trade-off is that for the first 8 years
of its existence, bees would trigger the bug so often that the project
recommended its users put $BEESHOME in its own subvol to make it easy
to remove ghost dirents left behind by the bug.

Some other filesystems, such as xfs, don't have any special semantics for
`rename`, and require `fsync` to avoid garbage or missing data after
a crash.  Even filesystems which do have a special case for `rename`
can be configured to turn it off.

btrfs will silently delete data from files in the event that an
unrecoverable data block write error occurs.  Kernel version 6.2 adds
important new and unexpected cases where this can happen on filesystems
using raid56 data, but it also happens in all usable btrfs versions
(the silent deletion behavior was introduced in kernel version 3.9).

Unrecoverable write errors are currently reported to userspace only
through `fsync`.  Since the failed extents are deleted, they cannot be
detected via csum failures or scrub after the fact--and it's too late
by then, the data is already gone.  `fsync` is the last opportunity
to detect the write failure before the `rename`.  If the error is not
detected, the contents of the file will be silently discarded in btrfs.
The impact on bees is that scans will abruptly restart from zero after
a crash combined with some other reasonably common failures.

Putting all of this together leads to a rather complex workaround:
if the filesystem under $BEESHOME (specifically, the filesystem where
BeesStringFile objects such as `beescrawl.dat` are written) is a btrfs
filesystem, and the host kernel is a version prior to 5.16, then don't
call `fsync` before `rename`.  In all other cases, do call `fsync`,
and prevent dependent writes (i.e. the following `rename`) in the event
of errors.

Since present kernel versions still require `fsync`, we don't need
an upper bound on the kernel version check until someone fixes btrfs
`rename` (or perhaps adds a flag to `renameat2` which prevents use of
the log tree) in the kernel.  Once that fix happens, we can drop the
`fsync` call for kernels after that fixed version.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-10 21:04:20 -05:00
Zygo Blaxell
962d94567c hexdump: fix pointer cast const mismatch
Another hit from the exotic compiler collection:  build fails on GCC 9,
from Ubuntu 20...but not later versions of GCC.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-10 21:00:31 -05:00
Zygo Blaxell
6dbef5f27b fs: improve compatibility with linux-libc-dev 5.4
Fix the missing symbols that popped up when adding chunk tree to
lib/fs.cc.  Also define the missing symbols instead of merely trying to
avoid them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-08 21:17:15 -05:00
Zygo Blaxell
88b1e4ca6e main: unconditionally enable workaround for the logical_ino-vs-clone kernel bug
This obviously doesn't fix or prevent the kernel bug, but it does prevent
bees from triggering the bug without assitance from another application.

The bug can still be triggered by running bees at the same time as an
application which uses clone or LOGICAL_INO.  `btdu` uses LOGICAL_INO,
while `cp` from coreutils (and many others) use clone (reflink copy).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 23:14:16 -05:00
Zygo Blaxell
c1d7fa13a5 roots: drop unnecessary mutex unlock in stop_request
In commit 31b2aa3c0d ("context: speed
up orderly process termination"), the stop request was split into two
methods after the mutex unlock.

Now that there's nothing after the mutex unlock in `stop_request`,
there's no need for an explicit unlock to do what the destructor would
have done anyway.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 23:14:16 -05:00
Zygo Blaxell
aa39bddb2d extent scan: implement an experimental ordered scan mode
Parallel scan runs each extent size tier in a separate thread.  The
threads compete to process extents within the tier's size range.

Ordered scan processes each extent size tier completely before moving on
to the next.  In theory, this means large extents always get processed
quickly, especially when new ones appear, and the queue does not fill up
with small extents.

In practice, the multi-threaded scanner massively outperforms the
single-threaded scanner, unless the number of worker threads is very
small (i.e. one).

Disable most of the feature for now, but leave the code in place so it
can be easily reactivated for future testing.

Ordered scan introduces a parallelized extent mapper Task.  Keep that in
parallel scan mode, which further enhances the parallelism.  The extent
scan crawl threads now run at 'idle' priority while the map tasks run
at normal priority, so the map tasks don't flood the task queue.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 23:14:16 -05:00
Zygo Blaxell
1aea2d2f96 crawl: deprecate use of BeesCrawl to search the extent tree
BeesScanModeExtent can do that by itself now.  Overloading the subvol
crawl code resulted in an ugly, inefficient hack, and we definitely
don't want to accidentally continue to use it.

Remove the support for reading the extent tree and add some `assert`s
to make sure it isn't still used somewhere.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
673b450671 docs: update event counters after extent scan refactoring and crawl skipping
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
183b6a5361 extent scan: refactor BeesCrawl, BeesScanMode*
The main gains here are:

* Move extent tree searches into BeesScanModeExtent so that they are
not slowed down by the BeesCrawl code, which was designed for the
much more specialized metadata in subvol trees.
* Enable short extent skipping now that BeesCrawl is out of the way.
* Stop enumerating btrfs subvols when in extent scan mode.

All this gets rid of >99% of unnecessary extent tree searches.
Incremental extent scan cycles now finish in milliseconds instead
of minutes.

BeesCrawl was never designed to cope with the structure and content of
the extent tree.  It would waste thousands of tree-search ioctl calls
reading and ignoring metadata items.

Performance was particularly bad when a binary search was involved, as any
binary search probe that landed in a metadata block group would read and
discard all the metadata items in the block group, sequentially, repeated
for each level of the binary search.  This was blocking implementation of
short extent skipping optimization for large extent size tiers, because
the skips were using thousands of tree searches to skip over only a few
hundred extent items.

Extent scan also had to read every extent item twice to do the
transid filtering, because BeesCrawl's interface discarded the relevant
information when it converted a `BtrfsTreeItem` into a `BeesFileRange`.
The cost of this extra fetch was negligible, but it could have been zero.

Fix this by:

* Copy the equivalent of `fetch_extents` from BeesCrawl into
`BeesScanModeExtent`, then give each of the extent scan crawlers its
own `BtrfsDataExtentTreeFetcher` instance.  This enables extent tree
searches to avoid pure (non-mixed) metadata block groups.  `BeesCrawl`
is now used only for its interface to `BeesRoots` for saving state in
`beescrawl.dat`, and never to determine the next extent tree item.

* Move subvol-specific parts of `BeesRoots` into a new class
`BeesScanModeSubvol` so that `BtrfsScanModeExtent` doesn't have to enable
or support them.  In particular, `bees -m4` no longer enumerates all
of the _subvol_ crawlers.  `BeesRoots` is still used to save and load
crawl state.

* Move several members from `BtrfsScanModeExtent` into a per-crawler
state object `SizeTier` to eliminate the need for some locks and to
maintain separate cache state for `BtrfsDataExtentTreeFetcher`.

* Reuse the `BtrfsTreeItem` to get the generation field for the transid
range filter.

* Avoid a few corner cases when handling errors, where extent scan might
drop an extent without scanning it, or fail to advance to the next extent.

* Enable the extent-skipping algorithm for large size tiers, now that
`BeesCrawl::fetch_extents` is no longer slowing it down.

* Add a debug stream interface which developers can easily turn on when
needed to inspect the decisions that extent scan is making.

* Track metrics that are more useful, particularly searches per extent
scanned, and fraction of extents that are skipped.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
b6446d7316 roots: rework open_root_nocache to use btrfs-tree
This gets rid of one open-coded btrfs tree search.

Also reduce the log noise level for subvol open failures, and remove
some ancient references to `BEESLOG`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
d32f31f411 btrfs-tree: harden rlower_bound against exceptional objects
Rearrange the logic in `rlower_bound` so it can cope with a tree
that contains mostly block-aligned objects, with a few exceptions
filtered out by `hdr_stop`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
dd08f6379f btrfs-tree: add a method to get root backref items to BtrfsRootFetcher
This complements the already existing support for reading the fields of
a root backref.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
58ee297cde btrfs-tree: connect methods to the debug stream interface
In some cases functions already had existing debug stream support
which can be redirected to the new interface.  In other cases, new
debug messages are added.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
a3c0ba0d69 fs: add a runtime debug stream for btrfs tree searches
This allows plugging in an ostream at run time so that we can audit all
the search calls we are doing.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
75040789c6 btrfs-tree: drop BtrfsFsTreeFetcher and clean up class comments
BtrfsFsTreeFetcher was used for early versions of the extent scanner, but
neither subvol nor extent scan now needs an object that is both persistent
and configured to access only one subvol.  BtrfsExtentDataFetcher does
the same thing in that case.

Clarify the comments on what the remaining classes do, so that
BtrfsFsTreeFetcher doesn't get inadvertently reinvented in the future.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
f9a697518d btrfs-tree: introduce BtrfsDataExtentTreeFetcher to read data extents without metadata
Binary searches can be extremely slow if the target bytenr is near a
metadata block group, because metadata items are not visible to the
binary search algorithm.  In a non-mixed-bg filesystem, there can be
hundreds of thousands of metadata items between data extent items, and
since the binary search algorithm can't see them, it will run searches
that iterate over hundreds of thousands of objects about a dozen times.

This is less of a problem for mixed-bg filesystems because the data and
metadata blocks are not isolated from each other.  The binary search
algorithm still can't see the metadata items, but there are usually
some data items close by to prevent the linear item filter from running
too long.

Introduce a new fetcher class (all the good names were taken) that tracks
where the end of the current block group is.  When the end of the current
block group is reached in the linear search, skip ahead to a block group
that can contain data items.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
c4ba6ec269 fs: add a ntoa function for chunk types
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
440740201a main: the base directory for --strip-paths should be root_fd, not cwd
The cwd is where core dumps and various profiling and verification
libraries want to write their data, whereas root_fd is the root of the
target filesystem.  These are often intentionally different.  When
they are different, `--strip-paths` sets the wrong prefix to strip
from paths.

Once the root fd has been established, we can set the path prefix to
the string prefix that we'll get from future calls to `name_fd`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
f6908420ad hash: handle $BEESHOME on non-btrfs
bees explicitly supports storing $BEESHOME on another filesystem, and
does not require that filesystem to be btrfs; however, if $BEESHOME
is on a non-btrfs filesystem, there is an exception on every startup
when trying to identify the subvol root of the hash table file in order
to blacklist it, because non-btrfs filesystems don't have subvol roots.

Fix by checking not only whether $BEESHOME is on btrfs, but whether it
is on the _same_ btrfs, as the bees root, without throwing an exception.
The hash table is blacklisted only when both filesystems are btrfs and
have the same fsid.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
925b12823e fs: add do_ioctl_nothrow and fsid methods to btrfs fs info
Enable use of the ioctl to probe whether two fds refer to the same btrfs,
without throwing an exception.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
561e604edc seeker: turn off debug logging
The debug log is only revealed when something goes wrong, but it is
created and discarded every time `seek_backward` is called, and it
is quite CPU-intensive.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
30cd375d03 readahead: clean up the code, update docs
Remove dubious comments and #if 0 section.  Document new event counters,
and add one for read failures.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
48b7fbda9c progress: adjust minimum thresholds for ETA to 10 seconds and 1 GiB of data
1% is a lot of data on a petabyte filesystem, and a long time to wait for an
ETA.

After 1 GiB we should have some idea of how fast we're reading the data.
Increase the time to 10 seconds to avoid a nonsense result just after a scan
starts.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
85aba7b695 openat2: #include <linux/types.h> so we can know __u64
Alternative implementations could use `uint64_t` instead, from `cstdint`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 17:02:19 -05:00
Zygo Blaxell
de38b46dd8 scripts/beesd: harden the mount options
* `nodev`: This reduces rename attack surface by preventing bees from
 opening any device file on the target filesystem.

 * `noexec`: This prevents access to the mount point from being leveraged
 to execute setuid binaries, or execute anything at all through the
 mount point.

These options are not required because they duplicate features in the
bees binary (assuming that the mount namespace remains private):

 * `noatime`: bees always opens every file with `O_NOATIME`, making
 this option redundant.

 * `nosymfollow`: bees uses `openat2` on kernels 5.6 and later with
 flags that prevent symlink attacks.  `nosymfollow` was introduced in
 kernel 5.10, so every kernel that can do `nosymfollow` can already do
 `openat2`.  Also, historically, `$BEESHOME` can be a relative path with
 symlinks in any path component except the last one, and `nosymfollow`
 doesn't allow that.

Between `openat2` and `nodev`, all symlink attacks are prevented, and
rename attacks cannot be used to force bees to open a device file.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 01:00:41 -05:00
Zygo Blaxell
0abf6ebb3d scripts/beesd: no need for $BEESHOME to be a subvol
We _recommend_ that `$BEESHOME` should be a subvol, and we'll create a
subvol if no directory exists; however, there's no reason to reject an
existing plain directory if the user chooses to use one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 00:43:13 -05:00
Kai Krakow
360ce7e125 scripts/beesd: Unshare namespace without systemd
If starting the beesd script without systemd, the mount point won't
automatically unmount if the script is cancelled with ctrl+c.

Fixes: https://github.com/Zygo/bees/issues/281
Signed-off-by: Kai Krakow <kai@kaishome.de>
2025-01-20 00:05:57 -05:00
Zygo Blaxell
ad11db2ee1 openat2: supply the missing definitions for building with old headers and new kernel
Apparently Ubuntu 20 has upgraded to kernel 5.15, but still builds things
with 5.4 headers.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:20:06 -05:00
Zygo Blaxell
874832dc58 openat2: log a warning when we fall back to openat
This should occur only once per run, but it's worth leaving a note
that it has happened.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:19:42 -05:00
Zygo Blaxell
5fe89d85c3 extent scan: make sure we run every extent crawler once per transaction
There's a pathological case where all of the extent scan crawlers except
one are at the end of a crawl cycle, but the one crawler that is still
running is keeping the Task queue full.  The result is that bees never
starts the other extent scan crawlers, because the queue is always
full at the instant a new transid triggers the start of a new scan.
That's bad because it will result in bees falling behind when new data
from the inactive size tiers appears.

To fix this, check for throttling _after_ creating at least one scan task
in each crawler.  That will keep the crawlers running, and possibly allow
them to claw back some space in the Task queue.  It slightly overcommits
the Task queue, so there will be a few more Tasks than nominally allowed.

Also (re)introduce some hysteresis in the queue size limit and reduce it
a little, so that bees isn't continually stopping and restarting crawls
every time one task is created or completed, and so that we stay under
the configured Task limit despite overcommitting.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:19:42 -05:00
Zygo Blaxell
a2b3e1e0c2 log: demote a lot of BEESLOGWARN to higher verbosity levels
Toxic extent workarounds are going away because the underlying kernel
bugs have been fixed.  They are no longer worthy of spamming non-developer
logs.

INO_PATHS can return no paths if an inode has been deleted.  It doesn't
need a log message at all, much less one at WARN level.

Dedupe failure can be INFO, the same level as dedupe itself, especially
since the "NO dedupe" message doesn't mention what was [not] deduped.

Inspired by Kai Krakow's "context: demote "abandoned toxic match" to
debug log level".

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 01:08:28 -05:00
Kai Krakow
aaec931081 context: demote "abandoned toxic match" to debug log level
This log message creates a overwhelmingly lot of messages in the system
journal, leading to write-back flushing storms under high activity. As
it is a work-around message, it is probably only useful to developers,
thus demote to debug level.

This fixes latency spikes in desktop usage after adding a lot of new
files, especially since systemd-journal starts to flush caches if it
sees memory pressure.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2025-01-19 00:59:22 -05:00
Zygo Blaxell
c53fa04a2f task: fixes for priority and idle Tasks
Tasks are not allowed to be queued more than once, but it is allowed
to queue a Task while it's already running, which means a Task can be
executed on two threads in parallel.  Tasks detect this and handle it
by queueing the Task on its own post-exec queue.  That in turn leads
to Workers which continually execute the same Task if that Task doesn't
create any new Tasks, while other Tasks sit on the Master queue waiting
for a Worker to dequeue them.

For idle Tasks, we don't want the Task to be rescheduled immediately.
We want the idle Task to execute again after every available Task on
both the main and idle queues has been executed.

Fix these by having each Task reschedule itself on the appropriate
queue when it finishes executing.

Priority queued Tasks should executed in priority order not just one
Task's post-exec queue, but the entire local queue of the TaskConsumer.

Fix this by moving the sort into either the TaskConsumer that receives
a post-exec queue, if there is one, or into the Task that is created
to insert the post-exec queue into a TaskConsumer when one becomes
available in the future.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-15 00:43:25 -05:00
Zygo Blaxell
d4a681c8a2 Revert "roots: use a non-idle task for next_transid"
next_transid tasks don't respect queue selection very well, because
they effectively end up spinning in a loop until all other worker
threads become busy.

Back this out, and fix the priority handling in the Task library.

This reverts commit 58db4071de.
2025-01-12 18:48:33 -05:00
Zygo Blaxell
a819d623f7 task: do not allow queue loops in priority queueing mode
Tasks using non-priority FIFO dependency tracking can insert themselves
into their own queue, to run the Task again immediately after it exits.

For priority queues, this attempts to splice the post-exec queue into
itself, which doesn't seem like a good idea.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-12 15:28:26 -05:00
Zygo Blaxell
de9d72da80 task: flatten queues of dependent Tasks
Suppose Task A, B, and C are created in that order, and currently running.
Task T acquires Exclusion E.  Task B, A, and C attempt to acquire the
same Exclusion, in that order, but fail because Task T holds it.

The result is Task T with a post-exec queue:

        T, [ B, A, C ]  sort_requested

Now suppose Task U acquires Exclusion F, then Task T attempts to acquire
Exclusion F.  Task T fails to acquire F, so T is inserted into U's
post-exec queue.  The result at the end of the execution of T is a tree:

        U, [ T ]  sort_requested
             \-> [ B, A, C ] sort_requested

Task T exits after failing to acquire a lock.  When T exits, T will
sort its post-exec queue and submit the post-exec queue for execution
immediately:

        Worker 1: U, [ T ]  sort_requested
        Worker 2: A, B, C

This isn't ideal because T, A, B, and C all depend on at least one
common Exclusion, so they are likely to immediately conflict with T
when U exits and T runs again.

Ideally, A, B, and C would at least remain in a common queue with T,
and ideally that queue is sorted.

Instead of inserting T into U's post-exec queue, insert T and all
of T's post-exec queue, which creates a single flattened Task list:

        U, [ T, B, A, C ]   sort_requested

Then when U exits, it will sort [ T, B, A, C ] into [ A, B, C, T ],
and run all of the queued Tasks in age priority order:

        U exited, [ T, B, A, C ]   sort_requested

        U exited, [ A, B, C, T ]

        [ A, B, C, T ] on TaskConsumer queue

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-12 14:05:44 -05:00
Zygo Blaxell
74d8bdd60f task: add an insert method for priority-queueing Tasks by age
Task started out as a self-organizing parallel-make algorithm, but ended
up becoming a half-broken wait-die algorithm.  When a contended object
is already locked, Tasks enter a FIFO queue to restart and acquire the
lock.  This is the "die" part of wait-die (all locks on an Exclusion are
non-blocking, so no Task ever does "wait").  The lock queue is FIFO wrt
_lock acquisition order_, not _Task age_ as required by the wait-die
algorithm.

Make it a 25%-broken wait-die algorithm by sorting the Tasks on lock
queues in order of Task ID, i.e. oldest-first, or FIFO wrt Task age.
This ensures the oldest Task waiting for an object is the one to get
it when it becomes available, as expected from the wait-die algorithm.

This should reduce the amount of time Tasks spend on the execution queue,
and reduce memory usage by avoiding the accumulation of Tasks that cannot
make forward progress.

Note that turning `TaskQueue` into an ordered container would have
undesirable side-effects:

 * `std::list` has some useful properties wrt stability of object
 location and cost of splicing.  Other containers may not have these,
 and `std::list` does have a `sort` method.

 * Some Task objects are created at the beginning and reused continually,
 but we really do want those Tasks to be executed in FIFO order wrt
 submission, not Task ID.  We can exclude these tasks by only doing the
 sorting when a Task is queued for an Exclusin object.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-12 00:35:37 -05:00
Zygo Blaxell
a5d078d48b docs: deprecate the --workaround-btrfs-send option
Emphasize that the option is relevant to old kernels, older than the
minimum supportable version threshold.

De-emphasize the use case of "send-workaround" as a synonym for "exclude
read-only".

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:56 -05:00
Zygo Blaxell
e2587cae9b docs: expand "Threads and load management" to suggest not running bees so much
One of the more obvious ways to reduce bees load is to simply not run
it all the time.  Explicitly state using maintenance windows as a load
management option.

SIGUSR1 and SIGUSR2 should have been documented somewhere else before now.
Better late than never.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:56 -05:00
Zygo Blaxell
ac581273d3 docs: config.md updates
The theories behind bees slowing down when presented with a larger has
table turned out to be wrong.  The real cause was a very old bug which
submitted thousands of `LOGICAL_INO` requests when only a handful of
requests were needed.

"Compression on the filesystem" -> "Compression in files"

Don't be so "dramatic".  Be "rapid" instead.

Remove "cannot avoid modifying read-only snapshots" as a distinction
between subvol and extent scans.  Both modes support send workaround
and send waiting with no significant distinction.

Emphasize extent scan's better handling of many snapshots.  Also reflinks.

Add some discussion of `--throttle-factor`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:56 -05:00
Zygo Blaxell
7fcde97b70 docs: update the bug reporting and status instructions
Thread names have changed.  Document some of the newer ones.

Don't jump immediately to blaming poor performance on qgroups or
autodefrag.  These do sometimes have kernel regressions but not all
the time.

Emphasize advantage of controlling bees deferred work requests at the
source, before btrfs gets stuck committing them.

Avoid asserting that it's OK for gdb to crash.

Remove mention of lower-layer block device issues wrt corruption.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
e457f502b7 docs: update kernel bugs page for January 2025
"Kernel" -> "Linux kernel".  If you can run bees on a kernel that isn't
Linux, congratulations!

Emphasize the age of the data corruption warnings.  Once 5.4 reaches
EOL we can remove those.

Simplify the discussion of old kernels and API levels.  There's a
new optional kernel API for `openat2` support at 5.6.  The absolute
minimum kernel version is still 4.2, and will not increase to 4.15
until the subvol scanners are removed.

Remove discussion of bees support for kernels 4.19 (which recently
reached EOL) and earlier.

The `LOGICAL_INO` vs dedupe bug is actually a `LOGICAL_INO` vs clone bug.
Dedupe isn't necessary to reproduce it.

Remove a stray ')'.

Strip out most of the discussion of slow backrefs, as they are no longer a
concern on the range of supported kernel versions.  Leave some description
there because bees still has some vestigial workarounds.

Remove `btrfs send` from the "Unfixed kernel bugs" section, which makes
the section empty, so remove the section too.  bees now handles send on
a subvol reasonably well.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
46815f1a9d docs: update README.md
Emphasize "large" is an upper bound on the size of filesystem bees
can handle.

New strengths:  largest extent first for fixed maintenance windows,
scans data only once (ish), recovers more space

Removed weaknesses:  less temporary space

Need more caps than `CAP_SYS_ADMIN`.

Emphasize DATA CORRUPTION WARNING is an old-kernel thing.

Update copyright year.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
0d251d30f4 docs: update feature interaction lists
Tested on larger filesystems than 100T too, but let's use Fermi
approximation.  Next size is 1P.

Removed interaction with block-level SSD caching subsystems.  These are
really btrfs metadata vs. a lower block layer, and have nothing to do
with bees.

Added mixed block groups to the tested list, as mixed block groups
required explicit support in the extent scanner.

Added btrfs-convert to the tested list.  btrfs-convert has various
problems with space allocation in general, but these can be solved by
carefully ordered balances after conversion, and they have nothing to
do with bees.

In-kernel dedupe is dead and the stubs were removed years ago.  Remove it
from the list.

btrfs send now plays nicely with bees on all supportable kernels, now
that stable/linux-4.19.y is dead.  Send workaround is only needed for
kernels before v5.4 (technically v5.2, but nobody should ever mount a
btrfs with kernel v5.1 to v5.3).  bees will pause automatically when
deduping a subvol that is currently running a send.

bees will no longer gratuitously refragment data that was defragmented
by autodefrag.

Explicitly list all the RAID profiles tested so far, as there have been
some new ones.

Explicitly list other deduplicators tested.

Sort the list of btrfs features alphabetically.

Add scrub and balance, which have been tested with bees since the
beginning.

New tested btrfs features:  block-group-tree, raid1c3, raid1c4.

New untested btrfs features:  squotas, raid-stripe-tree.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
b8dd9a2db0 progress: put a timestamp in the bottom row
This records the time when the progress data was calculated, to help
indicate when the data might be very old.

While we're here, move "now" out of the loop so there's only one value.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
8bc90b743b task: get rid of the insert_task method
Nothing calls it (not even tests), and there's significant functional
overlap with `try_lock`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
2f2a68be3d roots: use openat2 instead of openat when available
This increases resistance to symlink and mount attacks.

Previously, bees could follow a symlink or a mount point in a directory
component of a subvol or file name.  Once the file is opened, the open
file descriptor would be checked to see if its subvol and inode matches
the expected file in the target filesystem.  Files that fail to match
would be immediately closed.

With openat2 resolve flags, symlinks and mount points terminate path
resolution in the kernel.  Paths that lead through symlinks or onto
mount points cannot be opened at all.

Fall back to openat() if openat2() returns ENOSYS, so bees will still
run on kernels before v5.6.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-09 02:26:53 -05:00
Zygo Blaxell
82f1fd8054 process: replace crucible::gettid() with a weak symbol
Since we're now using weak symbols for dodgy libc functions, we might
as well do it for gettid() too.

Use the ::gettid() global namespace and let libc override it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-09 01:37:44 -05:00
Zygo Blaxell
a9b07d7684 openat2: create a weak syscall wrapper for it
openat2 allows closing more TOCTOU holes, but we can only use it when
the kernel supports it.

This should disappear seamlessly when libc implements the function.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-09 01:36:39 -05:00
Zygo Blaxell
613ddc3c71 progress: rename "ctime" -> "tm_left"
"ctime", an abbreviation of "cycle time", collides with "ctime", an
abbreviation of "st_ctime", a well-known filesystem term.

"tm_left" fits in the column, so use that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-06 12:50:50 -05:00
Zygo Blaxell
c3a39b7691 progress: rework the progress table after github discussion
* Report position within cycle in units that cannot be mistaken for size or percentage
* Put the total/maximum values in their own row
* Add a start time column
* Change column titles to reference "cycles"
* Use "idle" instead of "finished" when a crawler is not running
* Replace "transid" with "gen" because it's shorter

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:45:37 -05:00
Zygo Blaxell
58db4071de roots: use a non-idle task for next_transid
The scanners which finish early can become stuck behind scanners that are
able to keep the queue full.  Switch the next_transid task to the normal
Task queues so that we force scanners to restart on every new transaction,
possibly deferring already queued work to do so.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:36:53 -05:00
Zygo Blaxell
0d3e13cc5f context: report time in scan_one_extent
Add yet another field to the scan/skip report line:  the wallclock
time used to process the extent ref.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:36:53 -05:00
Zygo Blaxell
1af5fcdf34 roots: don't access a shared variable after releasing a lock
Access the local copy of `m_root_crawl_map` instead.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:36:53 -05:00
Zygo Blaxell
87472b6086 extent scan: don't put non-data block groups in the data extent map
The total data size should not include metadata or system block groups,
and already does not; however, we still have these block groups in the map
for mapping the crawl pointer to a logical offset within the filesystem.

Rearrange a few lines around the `if` statement so that the map doesn't
contain anything it should not.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:32:48 -05:00
Zygo Blaxell
ca351d389f extent scan: pick the right block groups for mixed-bg filesystems
The progress indicator was failing on a mixed-bg filesystem because those
filesystems have block groups which have both _DATA and _METADATA bits,
and the filesystem size calculation was excluding block groups that have
_METADATA set.  It should exclude block groups that have _DATA not set.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
1f0b8c623c options: improve message when too many--or too few--path arguments given
Running bees with no arguments complains about "Only one" path argument.
Replace this with "Exactly one" which uses similar terminology to other
btrfs tools.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
74296c644a options: return EXIT_SUCCESS after displaying help message
`getopt_long` already supplies a message when an option cannot be parsed,
so there isn't a need to distinguish option parse failures from help
requests.

Fixes: https://github.com/Zygo/bees/pull/277
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
231593bfbc throttle: don't hold the multilock during throttle
Release the lock before entering the throttle sleep, so that other
threads can still run.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
d4900cc5d5 docs: default throttle is zero
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
81bbf7e1d4 throttle: set default to 0.0
Longer latency testing runs are not showing a consistent gain from a
throttle factor of 1.0.  Make the default more conservative.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
bd9dc0229b docs: add --throttle-factor option
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
2a1ed0b455 throttle: track time values more closely
Decaying averages by 10% every 5 minutes gives roughly a half-hour
half-life to the rolling average.  Speed that up to once per minute.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:14:31 -05:00
Zygo Blaxell
d160edc15a throttle: add --throttle-factor option to control throttling factor
Also change the initializer syntax for the option list to use C99
compound literals.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:13:51 -05:00
Zygo Blaxell
e79b242ce2 options: clean up the parser, prepare for new options with no short form
We're not adding any more short options, but the debugging code doesn't
work with optvals above 255.  Also clean up constness and variable
lifetimes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-16 23:32:18 -05:00
Zygo Blaxell
ea45982293 throttle: add delays to match deferred request rate to btrfs completion rate
Measure the time spent running various operations that extend btrfs
transaction completion times (`LOGICAL_INO`, tmpfiles, and dedupe)
and arrange for each operation to run for not less than the average
amount of time by adding a sleep after each operation that takes less
than the average.

The delay after each operation is intended to slow down the rate of
deferred and long-running requests from bees to match the rate at which
btrfs is actually completing them.  This may help avoid big spikes in
latency if btrfs has so many requests queued that it has to force a
commit to release memory.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-16 23:32:18 -05:00
Zygo Blaxell
f209cafcd8 bees: bump the file limits again, 512k files and 64k dirs
Test machines keep blowing past the 32k file limit.  16 worker
threads at 10,000 files each is much larger than 32k.

Other high-FD-count services like DNS servers ask for million-file
rlimits.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-16 22:54:12 -05:00
Zygo Blaxell
c4b31bdd5c extent scan: no need for "No ref for extent" debug message
While a snapshot is being deleted, there will be a continuous stream of
"No ref for extent" messages.  This is a common event that does not need
to be reported.

There is an analogous situation when a call to open() fails with ENOENT.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-14 15:02:39 -05:00
Zygo Blaxell
08fe145988 context: wait for btrfs send to finish, then try dedupe again
Dedupe is not possible on a subvol where a btrfs send is running:

    BTRFS warning (device dm-22): cannot deduplicate to root 259417 while send operations are using it (1 in progress)

btrfs informs a process with EAGAIN that a dedupe could not be performed
due to a running send operation.

It would be possible to save the crawler state at the affected point,
fork a new crawler that avoids the subvol under send, and resume the
crawler state after a successful dedupe is detected; however, this only
helps the intersection of the set of users who have unrelated subvols
that don't share extents, and the set of users who cannot simply delay
dedupe until send is finished.  The simplest approach is to simply stop
and wait until the send goes away.

The simplest approach is taken here.  When a dedupe fails with EAGAIN,
affected Tasks will poll, approximately once per transaction, until the
dedupe succeeds or fails with a different error.

bees dedupe performance corresponds with the availability of subvols that
can accept dedupe requests.  While the dedupe is paused, no new Tasks can
be performed by the worker thread.  If subvols are small and isolated
from the bulk of the filesystem data, the result will be a small but
partial loss of dedupe performance during the send as some worker threads
get stuck on the sending subvol.  If subvols heavily share extents with
duplicate data in other subvols, worker threads will all become blocked,
and the entire bees process will pause until at least some of the running
sends terminate.

During the polling for btrfs send, the dedupe Task will hold its dst
file open.  This open FD won't interfere with snapshot or file delete
because send subvols are always read-only (it is not possible to delete
a file on a RO subvol, open or otherwise) and send itself holds the
affected subvol open, preventing its deletion.  Once the send terminates,
the dedupe will terminate soon after, and the normal FD release can occur.

This pausing during btrfs send is unrelated to the
`--workaround-btrfs-send` option, although `--workaround-btrfs-send` will
cause the pausing to trigger less often.  It applies to all scan modes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-14 14:51:28 -05:00
Zygo Blaxell
bb09b1ab0e roots: drop method transid_re
There are no callers of this method any more, and it exposes more
of BeesRoots than we really want things to have access to.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-13 23:19:43 -05:00
Zygo Blaxell
94d9945d04 roots: move the transid cache update into transid_max_nocache()
All callers of the `transid_max_nocache` method update `m_transid_re`
with the return value, so do that in `transid_max_nocache` itself.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-13 23:19:43 -05:00
Zygo Blaxell
a02588b16f time: add more methods to support dynamic rate throttling
* Allow RateLimiter to change rate after construction.
 * Check range of rate argument in constructor.
 * Atomic increment for RateEstimator.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:10:15 -05:00
Zygo Blaxell
21cedfb13e bytevector: rename the argument to operator[] to be more descriptive
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:10:15 -05:00
Zygo Blaxell
b9abcceacb progress: move the "finished" tag to a column where it won't obscure data
The "done" pointer and the "%done" fields are still useful because they
indicate _actual_ progress, not the work that has been _promised_.
So it is possible for a crawl to be "finished" (all extents queued)
but not "100.0000%" (some of those extents still active or in the queue).

"deferred" state isn't particularly useful, so drop it.

"finished" state implies no ETA, so that column is unused.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:10:15 -05:00
Zygo Blaxell
31f3a8d67d progress: relabel the inaccurate ETA column
ETA is calculated using a sample obtained by snooping on bees's normal
crawling operations.

This sample is heavily biased and not representative of the entire
filesystem.  If the distribution of extent sizes in the filesystem is
not uniform, the ETA can be wildly wrong.

Collecting an accurate sample set would require extra IO and CPU time
which should be spent doing dedupes instead.

Explicitly label the ETA as inaccurate to avoid having too many users
report the same bug.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:10:15 -05:00
Zygo Blaxell
9beb602b16 task: ignore paused status while calculating dynamic thread count
bees might be unpaused at any time, so make sure that the dynamic load
calculation is ready with a non-zero thread count.

This avoids a delay of up to 5 seconds when responding to SIGUSR2
when loadavg tracking is enabled.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:10:15 -05:00
Zygo Blaxell
0580c10082 main: add support for pause (SIGUSR1) and resume (SIGUSR2)
These are simple on/off switches for the task queue.  They are lightweight
requests for bees to be paused temporarily, but allow bees to release
open files and save progress while paused.

These signals are an alternative to SIGSTOP and SIGCONT, or using the
cgroup freezer's FROZEN and THAWED states, which pause and resume the
bees process, but do not allow the bees process to release open files
or save progress.  Snapshot and file deletes can occur on the filesystem
while bees is paused by SIGUSR1 but not by SIGSTOP.

These signals are also an alternative to SIGTERM and restart, which
flush out the whole hash table and progress state on exit, and read
the whole table back into memory on restart.

This feature is experimental and may be replaced by a more general
configuration or runtime control mechanism in the future.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:01:19 -05:00
Zygo Blaxell
1cbc894e6f task: start up more worker threads when unpausing
When paused, TaskConsumer threads will eventually notice the paused
condition and exit; however, there's nothing to restart threads when
exiting the paused state.

When unpausing, and while the lock is already held, create TaskConsumer
threads as needed to reach the target thread count.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 22:53:00 -05:00
Zygo Blaxell
d74862f1fc fs: set the correct nr_items to 0 in the ENOENT search case
Commit 72c3bf8438 ("fs: handle ENOENT
within lib") was meant to prevent exceptions when a subvol is deleted.

If the search ioctl fails, the kernel won't set nr_items in the
ioctl output, which means `nr_items` still has the input value.  When
ENOENT is detected, `this->nr_items` is set to 0, then later `*this =
ioctl_ptr->key` overwrites `this->nr_items` with the original requested
number of items.

This replaced the ENOENT exception with an exception triggered by
interpreting garbage in the memory buffer.  The number of exceptions
was reduced because the memory buffers are frequently reused, but upper
layers would then reject the data or ignore it because it didn't match
the key range.

Fix by setting `ioctl_ptr->key.nr_items`, which then overwrites
`this->nr_items`, so the loop that extracts items from the ioctl data
gets the right number of items (i.e. zero).

Fixes: 72c3bf8438 ("fs: handle ENOENT within lib")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 22:48:15 -05:00
Zygo Blaxell
e40339856f readahead: use the right parameter order when checking the range
In some cases the offset and size arguments were flipped when checking to
see if a range had already been read.  This would have been OK as long as
the same mistake had been made consistently, since `bees_readahead_check`
only does a cache lookup on the parameters, it doesn't try to use them to
read a file.  Alas, there was one case where the correct order was used,
albeit a relatively rare one.

Fix all the calls to use the correct order.

Also fix a comment:  the recent request cache is global to all threads.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-04 11:17:44 -05:00
Zygo Blaxell
1dd96f20c6 fs: drop extra declaration of hexdump
hexdump was moved into a template in its own header years ago, but
the declaration of the implementation that used to be in fs.cc remains.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-04 11:17:44 -05:00
Zygo Blaxell
cd7a71aba3 hexdump: be a little more lock-friendly
hexdump processes a vector as a contiguous sequence of bytes, regardless
of V's value type, so hexdump should get a pointer and use uint8_t to
read the data.

Some vector types have a lock and some atomics in their operator[], so
let's avoid hammering those.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-03 23:39:33 -05:00
Zygo Blaxell
e99a505b3b bytevector: don't deadlock on operator<<
operator<< was a friend class that locked the ByteVector, then invoked
hexdump on the bytevector, which used ByteVector::operator[]...which
locked the ByteVector, resulting in a deadlock.

operator<< shouldn't be a friend class anyway.  Make hexdump use the
normal public access methods for ByteVector.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-03 23:39:33 -05:00
Zygo Blaxell
3e89fe34ed roots: avoid copying a BtrfsIoctlSearchKey
Although all the members of BtrfsExtentDataFetcher are theoretically
copiable, there's no need to actually make any such copy.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-03 16:54:14 -05:00
Zygo Blaxell
dc74766179 context: spell "progress" correctly
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-02 09:50:28 -05:00
Zygo Blaxell
3a33a5386b context: add a PROGRESS: header in $BEESSTATUS
Make it clearer where the progress information goes.

Also add placeholder text so the progress section isn't empty at startup,
when the progress hasn't been calculated yet.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 11:41:59 -05:00
Zygo Blaxell
69e9bdfb0f docs: post-5.7 toxic extent handling
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:52 -05:00
Zygo Blaxell
7a197e2f33 bees: post-kernel-5.7 toxic extent handling
Toxic extents are mostly gone in kernel 5.7 and later.  Increase the
timeout for toxic extent handling to reduce false positives, and remove
persistenly stored toxic hashes from the hash table.

Toxic hashes are still stored nonpersistently to help mitigate problems
due to any remaining kernel bugs.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:52 -05:00
Zygo Blaxell
43d38ca536 extent scan: don't serialize dedupe and LOGICAL_INO when using extent scan mode
The serialization doesn't seem to be necessary for the extent scan mode.
No infinite loops in the kernel have been observed in the past two years,
despite never having used MultiLock for the extent scanner.

Leave the serialization for now on the subvol scanners.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:52 -05:00
Zygo Blaxell
7b0ed6a411 docs: default scan mode is 4, "extent"
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
8d4d153d1d main: set default scan mode to mode 4 (EXTENT)
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
d5a6c30623 docs: old missing features are not missing any more
The extent scan mode has been implemented (partially, but close enough
to win benchmarks).

New features include several nuisance dedupe countermeasures.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
25f7ced27b docs: add scan mode 4, "extent"
Extent is a different kind of scan mode, so introduce the concept of
the two kinds of scan mode, and rearrange the description of scan modes
along the new boundaries.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
c1af219246 progress: squeeze the progress table into 80 columns or less
We don't need the subvol numbers since they're only interesting to
developers.

We don't need both max and min sizes, pick one and drop the other.

Replace "16E" with "max"--it is the same number of characters, but
doesn't require the user to know what 1<<64 is off the top of their head.

Shorten "remain" to "todo" because sometimes those extra two columns
matter.

Drop the seconds field in ETA timestamps.  Long scan arrival times are
years away, and short scan arrival times are only updated once every
5 minutes, so the extra precision isn't useful.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
9c183c2c22 progress: put the progress table in the stats and status files
Make the progress information more accessible, without having to
enable full debug log and fish it out of the stream with grep.

Also increase the progress log level to INFO.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
59f8a467c3 extent scan: fix crawl_map creation
There are two crawl_maps in extent scan's next_transid:  one gets
initialized, the other gets used.  This works OK as long as bees is
resuming an existing scan, because the two maps are identical; however,
but it fails if bees is starting without an existing set of crawl data,
and one of the two maps is empty or partially filled.

The failure is intermittent, as the crawl map is being populated at
the same time next_transid runs.  It will eventually be completed after
several transaction cycles, at which point bees runs normally.
It does add significant delays during startup for benchmarks.

There's only one crawl_map in extent scan, it always has the same
crawlers, and extent scan's `next_transid` creates it by itself.
Ignore the map from BeesRoots/BeesCrawl.

Also throw in some missing but helpful trace statements.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
9987aa8583 progress: estimate actual data sizes for progress report
Replace pointers in the "done" and "total" columns with estimated data
sizes for each size tier.  The estimation is based on statistics
collected from extents scanned during the current bees run.

Move the total size for the entire filesystem up to the heading.

Report the _completed_ position (i.e. the one that would be saved in
`beescrawl.dat`), not the _queued_ position (i.e. the one where the
next Task would be created in memory).

At the end of the data, the crawl pointer ends up at some random point
in the filesystem just after the newest extent, so the progress gets to
99.7% and then goes to some random value like 47% or 3%, not to 100%.
Report "deferred" in the "done" column when the crawler is waiting for
the next transid, and "finished" in the "%done" column when the crawler
has reached the end of the data.  Suppress the ETA when finished.  This
makes it clear that there's no further work to do for these crawlers.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
da32667e02 docs: add event counters for extent scan
Add a section for all the new extent scan event counters.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
8080abac97 extent scan: refactor BeesScanMode so derived classes decide their own scan scheduling
BeesScanModeExtent uses six scan Tasks instead of one, which leads
to awkwardness like the do_scan method to tell crawl_roots how to do
what it shouldn't need to know how to do anyway.

Move the crawl_roots logic into the ::scan methods themselves.

This also deletes the very popular "crawl_more ran out of data" message.
Extent scan explicitly indicates when a scan is complete, so there's
no longer a need to fish this message out of the log.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
1e139d0ccc extent scan: put all the refs in a single Task, sort them, use idle task
The sorting avoids problematic read orders, like extent refs in the same
inode with descending offsets, that btrfs is not optimized for.

Putting everything in one Task keeps the queue sizes small, and
manages the lock contention much more calmly.

We only want to be mapping extent refs if there's not enough extents
already in the queue to keep worker threads busy, so use the `idle()`
method instead of `run()`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
6542917ffa extent scan: introduce SCAN_MODE_EXTENT
The EXTENT scan mode reads the extent tree, splits it into tiers by
extent size, converts each tiers's extents into subvol/inode/offset refs,
then runs the legacy bees dedupe engine on the refs.

The extent scan mode can cheaply compute completion percentage and ETA,
so do that every time a new transid is observed.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
b99d80b40f task: add an idle queue
Add a second level queue which is only serviced when the local and global
queues are empty.

At some point there might be a need to implement a full priority queue,
but for now two classes are sufficient.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
099ad2ce7c fs: add some performance metrics for TREE_SEARCH_V2 calls
These give some visibility into how efficiently bees is using the
TREE_SEARCH_V2 ioctl.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
a59a02174f table: add a simple text table renderer
This should help clean up some of the uglier status outputs.

Supports:

 * multi-line table cells
 * character fills
 * sparse tables
 * insert, delete by row and column
 * vertical separators

and not much else.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
e22653e2c6 docs: remove "matched_" prefix event counters
We can no longer reliably determine the number of hash table matches,
since we'll stop counting after the first one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
44810d6df8 scan_one_extent: remove the unreadahead after benchmark results
That unreadahead used to result in a 10% hit on benchmarks.  Now it's
closer to 75%.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
8f92b1dacc BeesRangePair: drop the _really_ expensive toxic extent workaround
We were doing a `LOGICAL_INO` ioctl on every _block_ of a matching extent,
just to see how long it takes.  It takes a while!

This could be modified to do an ioctl with the `IGNORE_OFFSET` flag,
once per new extent, but the kernel bug was fixed a long time ago, so
we can start removing all the toxic extent code.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
0b974b5485 scan_one_extent: in skip/scan lines, log whether extent is compressed
Useful for debugging the compressed-zero-block cases.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
ce0367dafe scan_one_extent: reduce the number of LOGICAL_INO calls before finding a duplicate block range
When we have multiple possible matches for a block, we proceed in three
phases:

1.  retrieve each match's extent refs and put them in a list,
2.  iterate over the list converting viable block matches into range matches,
3.  sort and flatten the list of range matches into a non-overlapping
list of ranges that cover all duplicate blocks exactly once.

The separation of phase 1 and 2 creates a performance issue when there
are many block matches in phase 1, and all the range matches in phase
2 are the same length.  Even though we might quickly find the longest
possible matching range early in phase 2, we first extract all of the
extent refs from every possible matching block in phase 1, even though
most of those refs will never be used.

Fix this by moving the extent ref retrieval in phase 1 into a single
loop in phase 2, and stop looping over matching blocks as soon as any
dedupe range is created.  This avoids iterating over a large list of
blocks with expensive `LOGICAL_INO` ioctls in an attempt to improve the
match when there is no hope of improvement, e.g. when all match ranges
are 4K and the content is extremely prevalent in the data.

If we find a matched block that is part of a short matching range,
we can replace it with a block that is part of a long matching range,
because there is a good chance we will find a matching hash block in
the long range by looking up hashes after the end of the short range.
In that case, overlapping dedupe ranges covering both blocks in the
target extent will be inserted into the dedupe list, and the longest
matches will be selected at phase 3.  This usually provides a similar
result to that of the loop in phase 1, but _much_ more efficiently.

Some operations are left in phase 1, but they are all using internal
functions, not ioctls.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
54ed6e1cff docs: event counter updates after fixing counter names and scan_one_extent improvements
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
24b08ef7b7 scan_one_extent: eliminate nuisance dedupes, drop caches after reading data
A laundry list of problems fixed:

 * Track which physical blocks have been read recently without making
 any changes, and don't read them again.

 * Separate dedupe, split, and hole-punching operations into distinct
 planning and execution phases.

 * Keep the longest dedupe from overlapping dedupe matches, and flatten
 them into non-overlapping operations.

 * Don't scan extents that have blocks already in the hash table.
 We can't (yet) touch such an extent without making unreachable space.
 Let them go.

 * Give better information in the scan summary visualization:  show dedupe
 range start and end points (<ddd>), matching blocks (=), copy blocks
 (+), zero blocks (0), inserted blocks (.), unresolved match blocks
 (M), should-have-been-inserted-but-for-some-reason-wasn't blocks (i),
 and there's-a-bug-we-didn't-do-this-one blocks (#).

 * Drop cached data from extents that have been inserted into the hash
 table without modification.

 * Rewrite the hole punching for uncompressed extents, which apparently
 hasn't worked properly since the beginning.

Nuisance dedupe elimination:

 * Don't do more than 100 dedupe, copy, or hole-punch operations per
 extent ref.

 * Don't split an extent or punch a hole unless dedupe would save at
 least half of the extent ref's size.

 * Write a "skip:" summary showing the planned work when nuisance
 dedupe elimination decides to skip an extent.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
97eab9655c types: add shrink_begin and shrink_end methods for BeesFileRange and BeesRangePair
These allow trimming of overlapping dedupes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
05bf1ebf76 counters: fix counter names for scan_eof, scan_no_fd, scanf_deferred_inode
This code gets moved around from time to time and ends up with the
wrong prefix.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
606ac01d56 multilock: allow turning it off
Add a master switch to turn off the entire MultiLock infrastructure for
testing, without having to remove and add all the individual entry points.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
72c3bf8438 fs: handle ENOENT within lib
This prevents the storms of exceptions that occur when a subvol is
deleted.  We simply treat the entire tree as if it was empty.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
72958a5e47 btrfs-tree: accessors for TreeFetcher classes' type and tree values
Sometimes we have a generic TreeFetcher and we need to know which tree
it came from.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
f25b4c81ba btrfs-tree: add root refs and extent flags fields
Lazily filling in accessor methods for btrfs objects as needed by bees.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
a64603568b task: fix try_lock argument description
try_lock allows specification of a different Task to be run instead of
the current Task when the lock is busy.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
33cde5de97 bees: increase file cache size limits
With some extents having 9999 refs, we can use much larger caches for
file descriptors.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
5414c7344f docs: resolve_overflow limit is only 655050 when BTRFS_MAX_EXTENT_REF_COUNT is
Use the current header value in the doc.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
8bac00433d bees: reduce extent ref limit to 9999
Originally the limit was 2730 (64KiB worth of ref pointers).  This limit
was a little too low for some common workloads, so it was then raised by
a factor of 256 to 699050, but there are a lot of problems with extent
counts that large.  Most of those problems are memory usage and speed
problems, but some of them trigger subtle kernel MM issues.

699050 references is too many to be practical.  Set the limit to 9999,
only 3-4x larger than the original 2730, to give up on deduplication
when each deduped ref reduces the amount of space by no more than 0.01%.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
088cbc951a docs: event counter updates after readahead sanity improvements
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
e78e05e212 readahead: inject more sanity at the foundation of an insane architecture
This solves a third bad problem with bees reads:

3.  The architecture above the read operations will issue read requests
for the same physical blocks over and over in a short period of time.

Fixing that properly requires rewriting the upper-level code, but a
simple small table of recent read requests can reduce the effect of the
problem by orders of magnitude.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
8d08a3c06f readahead: inject some sanity at the foundation of an insane architecture
This solves some of the worst problems with bees reads:

1.  The kernel readahead doesn't work.  More precisely, it's much better
adapted for a very different use case:  a single thread alternating
between reading a file sequentially and processing the data that was read.
bees has multiple threads which compete for access to IO and then issue
reads in random order immediately after the call to readahead.  The kernel
uses idle ioprio scheduling for the readaheads, so the readaheads get
preempted by the random reads, or cancels the readaheads because the
data access pattern isn't sequential after the readahead was issued.

2.  Seeking drives perform terribly with multiple competing readers,
especially with btrfs striped profiles where the iops are broken into
tiny stripe-sized pieces.  At one point I intended to read the btrfs
device map and figure out which devices can be read in parallel, but to
make that useful, the user needs to have an array with multiple drives
in single profile, or 4+ drives in raid1 profile.  In all other cases,
the elaborate calculations always return the same result:  there can be
only one reader at a time.

This commit fixes both problems:

1.  Don't use the kernel readahead.  Use normal reads into a dummy
buffer instead.

2.  Allow only one thread to readahead at any time.  Once the read is
completed, the data is in the page cache, and all the random-order small
reads that bees does will hit the page cache, not a spinning disk.
In some cases we need to read two things close together, so add a
`bees_readahead_pair` which holds one lock across both reads.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
cdcdf8e218 hash: use kernel readahead instead of bees_readahead to prefetch hash table
The hash table is read sequentially and from a single thread, so
the kernel's implementation of readahead is appropriate here.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
37f5b1bfa8 docs: add allocator regression in 6.0+ kernels
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
abe2afaeb2 context: when a task fails to acquire an extent lock, don't go ahead and scan the extent anyway
Commit c3b664fea5 ("context: don't forget
to retry locked extents") removed the critical return that prevents a
Task from processing an extent that is locked.

Put the return back.

Fixes: c3b664fea5 ("context: don't forget to retry locked extents")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
792fdbbb13 fs: get rid of 16 MiB limit on dedupe requests
The kernel has not required a 16 MiB limit on dedupe requests since
v4.18-rc1 b67287682688 ("Btrfs: dedupe_file_range ioctl: remove 16MiB
restriction").

Kernels before v4.18 would truncate the request and return the size
actually deduped in `bytes_deduped`.  Kernel v4.18 and later will loop
in the kernel until the entire request is satisfied (although still
in 16 MiB chunks, so larger extents will be split).

Modify the loop in userspace to measure the size the kernel actually
deduped, instead of assuming the kernel will only accept 16 MiB.
On current kernels this will always loop exactly once.

Since we now rely on `bytes_deduped`, make sure it has a sane value.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
30a4fb52cb Revert "context: add experimental code for avoiding tiny extents"
because this problem is better solved elsewhere.

This reverts commit 11fabd66a8.
2024-11-30 23:30:33 -05:00
Zygo Blaxell
90d7075358 usage: the default scan mode is 3 (recent)
The code and docs were changed some time ago, but not the usage message.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
faac895568 docs: add the 6.10..6.12 delayed refs bug
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
a7baa565e4 crawl: rename next_transid() to avoid confusion with BeesScanMode::next_transid()
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
b408eac98e trace: add file and line numbers all the way up the stack
These were added to crucible all the way back in 2018 (1beb61fb78
"crucible: error: record location of exception in what() message")
but it's even more useful in the stack tracer in bees.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:27:24 -05:00
Zygo Blaxell
75131f396f context: reduce the size of LOGICAL_INO buffers
Since we'll never process more than BEES_MAX_EXTENT_REF_COUNT extent
references by definition, it follows that we should not allocate buffer
space for them when we perform the LOGICAL_INO ioctl.

There is some evidence (particularly
https://github.com/Zygo/bees/issues/260#issuecomment-1627598058) that
the kernel is subjecting the page cache to a lot of disruption when
trying allocate large buffers for LOGICAL_INO.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-04-17 23:14:35 -04:00
Zygo Blaxell
cfb7592859 usage: the default scan mode is 1 (independent)
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-04-17 23:14:35 -04:00
Zygo Blaxell
3839690ba3 lib: fix btrfs_data_container pointer casts for 32-bit userspace on 64-bit kernels
Apparently reinterpret_cast<uint64_t> sign-extends 32-bit pointers.
This is OK when running on a 32-bit kernel that will truncate the pointer
to 32 bits, but when running on a 64-bit kernel, the extra bits are
interpreted as part of the (now very invalid) address.

Use <uintptr_t> instead, which is unsigned, integer, and the same word
size as the arch's pointer type.  Ordinary numeric conversion can take
it from there, filling the rest of the word with zeros.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-04-17 23:07:41 -04:00
50 changed files with 3758 additions and 1219 deletions

View File

@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
About bees
----------
bees is a block-oriented userspace deduplication agent designed for large
btrfs filesystems. It is an offline dedupe combined with an incremental
data scan capability to minimize time data spends on disk from write
to dedupe.
bees is a block-oriented userspace deduplication agent designed to scale
up to large btrfs filesystems. It is an offline dedupe combined with
an incremental data scan capability to minimize time data spends on disk
from write to dedupe.
Strengths
---------
* Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
* Daemon incrementally dedupes new data using btrfs tree search
* Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
* Daemon mode - incrementally dedupes new data as it appears
* Largest extents first - recover more free space during fixed maintenance windows
* Works with btrfs compression - dedupe any combination of compressed and uncompressed files
* Works around btrfs filesystem structure to free more disk space
* Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
* Persistent hash table for rapid restart after shutdown
* Whole-filesystem dedupe - including snapshots
* Constant hash table size - no increased RAM usage if data set becomes larger
* Works on live data - no scheduled downtime required
* Automatic self-throttling based on system load
* Automatic self-throttling - reduces system load
* btrfs support - recovers more free space from btrfs than naive dedupers
Weaknesses
----------
* Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
* Requires root privilege (or `CAP_SYS_ADMIN`)
* First run may require temporary disk space for extent reorganization
* Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
* [First run may increase metadata space usage if many snapshots exist](docs/gotchas.md)
* Constant hash table size - no decreased RAM usage if data set becomes smaller
* btrfs only
@@ -46,7 +46,7 @@ Recommended Reading
-------------------
* [bees Gotchas](docs/gotchas.md)
* [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING
* [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
* [bees vs. other btrfs features](docs/btrfs-other.md)
* [What to do when something goes wrong](docs/wrong.md)
@@ -69,6 +69,6 @@ You can also use Github:
Copyright & License
-------------------
Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.
GPL (version 3 or later).

View File

@@ -1,31 +1,24 @@
Recommended Kernel Version for bees
===================================
Recommended Linux Kernel Version for bees
=========================================
First, a warning that is not specific to bees:
First, a warning about old Linux kernel versions:
> **Kernel 5.1, 5.2, and 5.3 should not be used with btrfs due to a
severe regression that can lead to fatal metadata corruption.**
This issue is fixed in kernel 5.4.14 and later.
> **Linux kernel version 5.1, 5.2, and 5.3 should not be used with btrfs
due to a severe regression that can lead to fatal metadata corruption.**
This issue is fixed in version 5.4.14 and later.
**Recommended kernel versions for bees are 4.19, 5.4, 5.10, 5.11, 5.15,
6.0, or 6.1, with recent LTS and -stable updates.** The latest released
kernel as of this writing is 6.4.1.
**Recommended Linux kernel versions for bees are 5.4, 5.10, 5.15, 6.1,
6.6, or 6.12 with recent LTS and -stable updates.** The latest released
kernel as of this writing is 6.12.9, and the earliest supported LTS
kernel is 5.4.
4.14, 4.9, and 4.4 LTS kernels with recent updates are OK with some
issues. Older kernels will be slower (a little slower or a lot slower
depending on which issues are triggered). Not all fixes are backported.
Obsolete non-LTS kernels have a variety of unfixed issues and should
not be used with btrfs. For details see the table below.
bees requires btrfs kernel API version 4.2 or higher, and does not work
at all on older kernels.
Some bees features rely on kernel 4.15 to work, and these features will
not be available on older kernels. Currently, bees is still usable on
older kernels with degraded performance or with options disabled, but
support for older kernels may be removed.
Some optional bees features use kernel APIs introduced in kernel 4.15
(extent scan) and 5.6 (`openat2` support). These bees features are not
available on older kernels. Support for older kernels may be removed
in a future bees release.
bees will not run at all on kernels before 4.2 due to lack of minimal
API support.
@@ -62,14 +55,17 @@ These bugs are particularly popular among bees users, though not all are specifi
| 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.4.125, 5.10.43, 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
| - | 5.11 | tree mod log issue #5 | 4.4.263, 4.9.263, 4.14.227, 4.19.183, 5.4.108, 5.10.26, 5.11.9, 5.12 and later | dbcc7d57bffc btrfs: fix race when cloning extent buffer during rewind of an old root
| - | 5.12 | tree mod log issue #6 | 4.14.233, 4.19.191, 5.4.118, 5.10.36, 5.11.20, 5.12.3, 5.13 and later | f9690f426b21 btrfs: fix race when picking most recent mod log operation for an old root
| 5.11 | 5.12 | subvols marked for deletion with `btrfs sub del` become permanently undeletable ("ghost" subvols) | 5.12 stopped creation of new ghost subvols | Partially fixed in 8d488a8c7ba2 btrfs: fix subvolume/snapshot deletion not triggered on mount. Qu wrote a [patch](https://github.com/adam900710/linux/commit/9de990fcc8864c376eb28aa7482c54321f94acd4) to allow `btrfs sub del -i` to remove "ghost" subvols, but it was never merged upstream.
| 4.15 | 5.16 | spurious warnings from `fs/fs-writeback.c` when `flushoncommit` is enabled | 5.15.27, 5.16.13, 5.17 and later | a0f0cf8341e3 btrfs: get rid of warning on transaction commit when using flushoncommit
| - | 5.17 | crash during device removal can make filesystem unmountable | 5.15.54, 5.16.20, 5.17.3, 5.18 and later | bbac58698a55 btrfs: remove device item and update super block in the same transaction
| - | 5.18 | wrong superblock num_devices makes filesystem unmountable | 4.14.283, 4.19.247, 5.4.198, 5.10.121, 5.15.46, 5.17.14, 5.18.3, 5.19 and later | d201238ccd2f btrfs: repair super block num_devices automatically
| 5.18 | 5.19 | parent transid verify failed during log tree replay after a crash during a rename operation | 5.18.18, 5.19.2, 6.0 and later | 723df2bcc9e1 btrfs: join running log transaction when logging new name
| 5.12 | 6.0 | space cache corruption and potential double allocations | 5.15.65, 5.19.6, 6.0 and later | ced8ecf026fd btrfs: fix space cache corruption and potential double allocations
| 6.0 | 6.5 | suboptimal allocation in multi-device filesystems due to chunk allocator regression | 6.1.60, 6.5.9, 6.6 and later | 8a540e990d7d btrfs: fix stripe length calculation for non-zoned data chunk allocation
| 6.3, backported to 5.15.107, 6.1.24, 6.2.11 | 6.3 | vmalloc error, failed to allocate pages | 6.3.10, 6.4 and later. Bug (f349b15e183d "mm: vmalloc: avoid warn_alloc noise caused by fatal signal" in v6.3-rc6) backported to 6.1.24, 6.2.11, and 5.15.107. | 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
| 6.2 | 6.3 | `IGNORE_OFFSET` flag ignored in `LOGICAL_INO` ioctl | 6.2.16, 6.3.3, 6.4 and later | 0cad8f14d70c btrfs: fix backref walking not returning all inode refs
| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe ioctl on the same extent | - | workaround: avoid doing that
| 6.10 | 6.11 | `adding refs to an existing tree ref`, `failed to run delayed ref`, then read-only | 6.11.10, 6.12 and later | 7d493a5ecc26 btrfs: fix incorrect comparison for delayed refs
| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe/clone ioctl on the same extent | - | workaround: avoid doing that
"Last bad kernel" refers to that version's last stable update from
kernel.org. Distro kernels may backport additional fixes. Consult
@@ -95,12 +91,12 @@ contains the last committed component of the fix.
Workarounds for known kernel bugs
---------------------------------
* **Hangs with concurrent `LOGICAL_INO` and dedupe**: on all
kernel versions so far, multiple threads running `LOGICAL_INO`
and dedupe ioctls at the same time on the same inodes or extents
* **Hangs with concurrent `LOGICAL_INO` and dedupe/clone**: on all
kernel versions so far, multiple threads running `LOGICAL_INO` and
dedupe/clone ioctls at the same time on the same inodes or extents
can lead to a kernel hang. The kernel enters an infinite loop in
`add_all_parents`, where `count` is 0, `ref->count` is 1, and
`btrfs_next_item` or `btrfs_next_old_item` never find a matching ref).
`btrfs_next_item` or `btrfs_next_old_item` never find a matching ref.
bees has two workarounds for this bug: 1. schedule work so that multiple
threads do not simultaneously access the same inode or the same extent,
@@ -121,58 +117,32 @@ Workarounds for known kernel bugs
It is still theoretically possible to trigger the kernel bug when
running bees at the same time as other dedupers, or other programs
that use `LOGICAL_INO` like `btdu`; however, it's extremely difficult
to reproduce the bug without closely cooperating threads.
that use `LOGICAL_INO` like `btdu`, or when performing a reflink clone
operation such as `cp` or `mv`; however, it's extremely difficult to
reproduce the bug without closely cooperating threads.
* **Slow backrefs** (aka toxic extents): Under certain conditions,
if the number of references to a single shared extent grows too
high, the kernel consumes more and more CPU while also holding locks
that delay write access to the filesystem. bees avoids this bug
by measuring the time the kernel spends performing `LOGICAL_INO`
operations and permanently blacklisting any extent or hash involved
where the kernel starts to get slow. In the bees log, such blocks
are labelled as 'toxic' hash/block addresses. Toxic extents are
rare (about 1 in 100,000 extents become toxic), but toxic extents can
become 8 orders of magnitude more expensive to process than the fastest
non-toxic extents. This seems to affect all dedupe agents on btrfs;
at this time of writing only bees has a workaround for this bug.
* **Slow backrefs** (aka toxic extents): On older kernels, under certain
conditions, if the number of references to a single shared extent grows
too high, the kernel consumes more and more CPU while also holding
locks that delay write access to the filesystem. This is no longer
a concern on kernels after 5.7 (or an up-to-date 5.4 LTS version),
but there are still some remains of earlier workarounds for this issue
in bees that have not been fully removed.
This workaround is less necessary for kernels 5.4.96, 5.7 and later,
though the bees workaround can still be triggered on newer kernels
by changes in btrfs since kernel version 5.1.
bees avoided this bug by measuring the time the kernel spends performing
`LOGICAL_INO` operations and permanently blacklisting any extent or
hash involved where the kernel starts to get slow. In the bees log,
such blocks are labelled as 'toxic' hash/block addresses.
Future bees releases will remove toxic extent detection (it only detects
false positives now) and clear all previously saved toxic extent bits.
* **dedupe breaks `btrfs send` in old kernels**. The bees option
`--workaround-btrfs-send` prevents any modification of read-only subvols
in order to avoid breaking `btrfs send`.
in order to avoid breaking `btrfs send` on kernels before 5.2.
This workaround is no longer necessary to avoid kernel crashes
and send performance failure on kernel 4.9.207, 4.14.159, 4.19.90,
5.3.17, 5.4.4, 5.5 and later; however, some conflict between send
and dedupe still remains, so the workaround is still useful.
This workaround is no longer necessary to avoid kernel crashes and
send performance failure on kernel 5.4.4 and later. bees will pause
dedupe until the send is finished on current kernels.
`btrfs receive` is not and has never been affected by this issue.
Unfixed kernel bugs
-------------------
* **The kernel does not permit `btrfs send` and dedupe to run at the
same time**. Recent kernels no longer crash, but now refuse one
operation with an error if the other operation was already running.
bees has not been updated to handle the new dedupe behavior optimally.
Optimal behavior is to defer dedupe operations when send is detected,
and resume after the send is finished. Current bees behavior is to
complain loudly about each individual dedupe failure in log messages,
and abandon duplicate data references in the snapshot that send is
processing. A future bees version shall have better handling for
this situation.
Workaround: send `SIGSTOP` to bees, or terminate the bees process,
before running `btrfs send`.
This workaround is not strictly required if snapshot is deleted after
sending. In that case, any duplicate data blocks that were not removed
by dedupe will be removed by snapshot delete instead. The workaround
still saves some IO.
`btrfs receive` is not affected by this issue.

View File

@@ -3,40 +3,34 @@ Good Btrfs Feature Interactions
bees has been tested in combination with the following:
* btrfs compression (zlib, lzo, zstd), mixtures of compressed and uncompressed extents
* btrfs compression (zlib, lzo, zstd)
* PREALLOC extents (unconditionally replaced with holes)
* HOLE extents and btrfs no-holes feature
* Other deduplicators, reflink copies (though bees may decide to redo their work)
* btrfs snapshots and non-snapshot subvols (RW and RO)
* Other deduplicators (`duperemove`, `jdupes`)
* Reflink copies (modern coreutils `cp` and `mv`)
* Concurrent file modification (e.g. PostgreSQL and sqlite databases, VMs, build daemons)
* All btrfs RAID profiles
* IO errors during dedupe (read errors will throw exceptions, bees will catch them and skip over the affected extent)
* Filesystems mounted with or without the `flushoncommit` option
* All btrfs RAID profiles: single, dup, raid0, raid1, raid10, raid1c3, raid1c4, raid5, raid6
* IO errors during dedupe (affected extents are skipped)
* 4K filesystem data block size / clone alignment
* 64-bit and 32-bit LE host CPUs (amd64, x86, arm)
* Large files (kernel 5.4 or later strongly recommended)
* Filesystems up to 90T+ bytes, 1000M+ files
* Filesystem data sizes up to 100T+ bytes, 1000M+ files
* `open(O_DIRECT)` (seems to work as well--or as poorly--with bees as with any other btrfs feature)
* btrfs-convert from ext2/3/4
* btrfs `autodefrag` mount option
* btrfs balance (data balances cause rescan of relocated data)
* btrfs block-group-tree
* btrfs `flushoncommit` and `noflushoncommit` mount options
* btrfs mixed block groups
* btrfs `nodatacow`/`nodatasum` inode attribute or mount option (bees skips all nodatasum files)
* btrfs qgroups and quota support (_not_ squotas)
* btrfs receive
* btrfs nodatacow/nodatasum inode attribute or mount option (bees skips all nodatasum files)
* open(O_DIRECT) (seems to work as well--or as poorly--with bees as with any other btrfs feature)
* lvm dm-cache, writecache
* btrfs scrub
* btrfs send (dedupe pauses automatically, kernel 5.4 or later required)
* btrfs snapshot, non-snapshot subvols (RW and RO), snapshot delete
Bad Btrfs Feature Interactions
------------------------------
bees has been tested in combination with the following, and various problems are known:
* btrfs send: there are bugs in `btrfs send` that can be triggered by
bees on old kernels. The [`--workaround-btrfs-send` option](options.md)
works around this issue by preventing bees from modifying read-only
snapshots.
* btrfs qgroups: very slow, sometimes hangs...and it's even worse when
bees is running.
* btrfs autodefrag mount option: bees cannot distinguish autodefrag
activity from normal filesystem activity, and may try to undo the
autodefrag if duplicate copies of the defragmented data exist.
**Note:** some btrfs features have minimum kernel versions which are
higher than the minimum kernel version for bees.
Untested Btrfs Feature Interactions
-----------------------------------
@@ -45,10 +39,6 @@ bees has not been tested with the following, and undesirable interactions may oc
* Non-4K filesystem data block size (should work if recompiled)
* Non-equal hash (SUM) and filesystem data block (CLONE) sizes (need to fix that eventually)
* btrfs seed filesystems (no particular reason it wouldn't work, but no one has reported trying)
* btrfs out-of-tree kernel patches (e.g. in-kernel dedupe, encryption, extent tree v2)
* btrfs-convert from ext2/3/4 (never tested, might run out of space or ignore significant portions of the filesystem due to sanity checks)
* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
* btrfs seed filesystems, raid-stripe-tree, squotas (no particular reason these wouldn't work, but no one has reported trying)
* btrfs out-of-tree kernel patches (e.g. encryption, extent tree v2)
* Host CPUs with exotic page sizes, alignment requirements, or endianness (ppc, alpha, sparc, strongarm, s390, mips, m68k...)
* bcache: used to be in the "bad" list, now in the "untested" list because nobody is rigorously testing, and bcache bugs come and go
* flashcache: an out-of-tree cache-HDD-on-SSD block layer helper

View File

@@ -26,11 +26,7 @@ Here are some numbers to estimate appropriate hash table sizes:
Notes:
* If the hash table is too large, no extra dedupe efficiency is
obtained, and the extra space wastes RAM. If the hash table contains
more block records than there are blocks in the filesystem, the extra
space can slow bees down. A table that is too large prevents obsolete
data from being evicted, so bees wastes time looking for matching data
that is no longer present on the filesystem.
obtained, and the extra space wastes RAM.
* If the hash table is too small, bees extrapolates from matching
blocks to find matching adjacent blocks in the filesystem that have been
@@ -59,19 +55,19 @@ patterns on dedupe effectiveness without performing deep inspection of
both the filesystem data and its structure--a task that is as expensive
as performing the deduplication.
* **Compression** on the filesystem reduces the average extent length
compared to uncompressed filesystems. The maximum compressed extent
length on btrfs is 128KB, while the maximum uncompressed extent length
is 128MB. Longer extents decrease the optimum hash table size while
shorter extents increase the optimum hash table size because the
probability of a hash table entry being present (i.e. unevicted) in
each extent is proportional to the extent length.
* **Compression** in files reduces the average extent length compared
to uncompressed files. The maximum compressed extent length on
btrfs is 128KB, while the maximum uncompressed extent length is 128MB.
Longer extents decrease the optimum hash table size while shorter extents
increase the optimum hash table size, because the probability of a hash
table entry being present (i.e. unevicted) in each extent is proportional
to the extent length.
As a rule of thumb, the optimal hash table size for a compressed
filesystem is 2-4x larger than the optimal hash table size for the same
data on an uncompressed filesystem. Dedupe efficiency falls dramatically
with hash tables smaller than 128MB/TB as the average dedupe extent size
is larger than the largest possible compressed extent size (128KB).
data on an uncompressed filesystem. Dedupe efficiency falls rapidly with
hash tables smaller than 128MB/TB as the average dedupe extent size is
larger than the largest possible compressed extent size (128KB).
* **Short writes or fragmentation** also shorten the average extent
length and increase optimum hash table size. If a database writes to
@@ -98,27 +94,70 @@ code files over and over, so it will need a smaller hash table than a
backup server which has to refer to the oldest data on the filesystem
every time a new client machine's data is added to the server.
Scanning modes for multiple subvols
-----------------------------------
Scanning modes
--------------
The `--scan-mode` option affects how bees schedules worker threads
between subvolumes. Scan modes are an experimental feature and will
likely be deprecated in favor of a better solution.
The `--scan-mode` option affects how bees iterates over the filesystem,
schedules extents for scanning, and tracks progress.
Scan mode can be changed at any time by restarting bees with a different
mode option. Scan state tracking is the same for all of the currently
implemented modes. The difference between the modes is the order in
which subvols are selected.
There are now two kinds of scan mode: the legacy **subvol** scan modes,
and the new **extent** scan mode.
If a filesystem has only one subvolume with data in it, then the
`--scan-mode` option has no effect. In this case, there is only one
subvolume to scan, so worker threads will all scan that one.
Scan mode can be changed by restarting bees with a different scan mode
option.
Within a subvol, there is a single optimal scan order: files are scanned
in ascending numerical inode order. Each worker will scan a different
inode to avoid having the threads contend with each other for locks.
File data is read sequentially and in order, but old blocks from earlier
scans are skipped.
Extent scan mode:
* Works with 4.15 and later kernels.
* Can estimate progress and provide an ETA.
* Can optimize scanning order to dedupe large extents first.
* Can keep up with frequent creation and deletion of snapshots.
Subvol scan modes:
* Work with 4.14 and earlier kernels.
* Cannot estimate or report progress.
* Cannot optimize scanning order by extent size.
* Have problems keeping up with multiple snapshots created during a scan.
The default scan mode is 4, "extent".
If you are using bees for the first time on a filesystem with many
existing snapshots, you should read about [snapshot gotchas](gotchas.md).
Subvol scan modes
-----------------
Subvol scan modes are maintained for compatibility with existing
installations, but will not be developed further. New installations
should use extent scan mode instead.
The _quantity_ of text below detailing the shortcomings of each subvol
scan mode should be informative all by itself.
Subvol scan modes work on any kernel version supported by bees. They
are the only scan modes usable on kernel 4.14 and earlier.
The difference between the subvol scan modes is the order in which the
files from different subvols are fed into the scanner. They all scan
files in inode number order, from low to high offset within each inode,
the same way that a program like `cat` would read files (but skipping
over old data from earlier btrfs transactions).
If a filesystem has only one subvolume with data in it, then all of
the subvol scan modes are equivalent. In this case, there is only one
subvolume to scan, so every possible ordering of subvols is the same.
The `--workaround-btrfs-send` option pauses scanning subvols that are
read-only. If the subvol is made read-write (e.g. with `btrfs prop set
$subvol ro false`), or if the `--workaround-btrfs-send` option is removed,
then the scan of that subvol is unpaused and dedupe proceeds normally.
Space will only be recovered when the last read-only subvol is deleted.
Subvol scan modes cannot efficiently or accurately calculate an ETA for
completion or estimate progress through the data. They simply request
"the next new inode" from btrfs, and they are completed when btrfs says
there is no next new inode.
Between subvols, there are several scheduling algorithms with different
trade-offs:
@@ -126,68 +165,151 @@ trade-offs:
Scan mode 0, "lockstep", scans the same inode number in each subvol at
close to the same time. This is useful if the subvols are snapshots
with a common ancestor, since the same inode number in each subvol will
have similar or identical contents. This maximizes the likelihood
that all of the references to a snapshot of a file are scanned at
close to the same time, improving dedupe hit rate and possibly taking
advantage of VFS caching in the Linux kernel. If the subvols are
unrelated (i.e. not snapshots of a single subvol) then this mode does
not provide significant benefit over random selection. This mode uses
smaller amounts of temporary space for shorter periods of time when most
subvols are snapshots. When a new snapshot is created, this mode will
stop scanning other subvols and scan the new snapshot until the same
inode number is reached in each subvol, which will effectively stop
dedupe temporarily as this data has already been scanned and deduped
in the other snapshots.
have similar or identical contents. This maximizes the likelihood that
all of the references to a snapshot of a file are scanned at close to
the same time, improving dedupe hit rate. If the subvols are unrelated
(i.e. not snapshots of a single subvol) then this mode does not provide
any significant advantage. This mode uses smaller amounts of temporary
space for shorter periods of time when most subvols are snapshots. When a
new snapshot is created, this mode will stop scanning other subvols and
scan the new snapshot until the same inode number is reached in each
subvol, which will effectively stop dedupe temporarily as this data has
already been scanned and deduped in the other snapshots.
Scan mode 1, "independent", scans the next inode with new data in each
subvol. Each subvol's scanner shares inodes uniformly with all other
subvol scanners until the subvol has no new inodes left. This mode makes
continuous forward progress across the filesystem and provides average
performance across a variety of workloads, but is slow to respond to new
data, and may spend a lot of time deduping short-lived subvols that will
soon be deleted when it is preferable to dedupe long-lived subvols that
will be the origin of future snapshots. When a new snapshot is created,
previous subvol scans continue as before, but the time is now divided
among one more subvol.
Scan mode 1, "independent", scans the next inode with new data in
each subvol. There is no coordination between the subvols, other than
round-robin distribution of files from each subvol to each worker thread.
This mode makes continuous forward progress in all subvols. When a new
snapshot is created, previous subvol scans continue as before, but the
worker threads are now divided among one more subvol.
Scan mode 2, "sequential", scans one subvol at a time, in numerical subvol
ID order, processing each subvol completely before proceeding to the
next subvol. This avoids spending time scanning short-lived snapshots
that will be deleted before they can be fully deduped (e.g. those used
for `btrfs send`). Scanning is concentrated on older subvols that are
more likely to be origin subvols for future snapshots, eliminating the
need to dedupe future snapshots separately. This mode uses the largest
amount of temporary space for the longest time, and typically requires
a larger hash table to maintain dedupe hit rate.
ID order, processing each subvol completely before proceeding to the next
subvol. This avoids spending time scanning short-lived snapshots that
will be deleted before they can be fully deduped (e.g. those used for
`btrfs send`). Scanning starts on older subvols that are more likely
to be origin subvols for future snapshots, eliminating the need to
dedupe future snapshots separately. This mode uses the largest amount
of temporary space for the longest time, and typically requires a larger
hash table to maintain dedupe hit rate.
Scan mode 3, "recent", scans the subvols with the highest `min_transid`
value first (i.e. the ones that were most recently completely scanned),
then falls back to "independent" mode to break ties. This interrupts
long scans of old subvols to give a rapid dedupe response to new data,
then returns to the old subvols after the new data is scanned. It is
useful for large filesystems with multiple active subvols and rotating
snapshots, where the first-pass scan can take months, but new duplicate
data appears every day.
long scans of old subvols to give a rapid dedupe response to new data
in previously scanned subvols, then returns to the old subvols after
the new data is scanned.
The default scan mode is 1, "independent".
Extent scan mode
----------------
If you are using bees for the first time on a filesystem with many
existing snapshots, you should read about [snapshot gotchas](gotchas.md).
Scan mode 4, "extent", scans the extent tree instead of the subvol trees.
Extent scan mode reads each extent once, regardless of the number of
reflinks or snapshots. It adapts to the creation of new snapshots
and reflinks immediately, without having to revisit old data.
In the extent scan mode, extents are separated into multiple size tiers
to prioritize large extents over small ones. Deduping large extents
keeps the metadata update cost low per block saved, resulting in faster
dedupe at the start of a scan cycle. This is important for maximizing
performance in use cases where bees runs for a limited time, such as
during an overnight maintenance window.
Once the larger size tiers are completed, dedupe space recovery speeds
slow down significantly. It may be desirable to stop bees running once
the larger size tiers are finished, then start bees running some time
later after new data has appeared.
Each extent is mapped in physical address order, and all extent references
are submitted to the scanner at the same time, resulting in much better
cache behavior and dedupe performance compared to the subvol scan modes.
The "extent" scan mode is not usable on kernels before 4.15 because
it relies on the `LOGICAL_INO_V2` ioctl added in that kernel release.
When using bees with an older kernel, only subvol scan modes will work.
Extents are divided into virtual subvols by size, using reserved btrfs
subvol IDs 250..255. The size tier groups are:
* 250: 32M+1 and larger
* 251: 8M+1..32M
* 252: 2M+1..8M
* 253: 512K+1..2M
* 254: 128K+1..512K
* 255: 128K and smaller (includes all compressed extents)
Extent scan mode can efficiently calculate dedupe progress within
the filesystem and estimate an ETA for completion within each size
tier; however, the accuracy of the ETA can be questionable due to the
non-uniform distribution of block addresses in a typical user filesystem.
Older versions of bees do not recognize the virtual subvols, so running
an old bees version after running a new bees version will reset the
"extent" scan mode's progress in `beescrawl.dat` to the beginning.
This may change in future bees releases, i.e. extent scans will store
their checkpoint data somewhere else.
The `--workaround-btrfs-send` option behaves differently in extent
scan modes: In extent scan mode, dedupe proceeds on all subvols that are
read-write, but all subvols that are read-only are excluded from dedupe.
Space will only be recovered when the last read-only subvol is deleted.
During `btrfs send` all duplicate extents in the sent subvol will not be
removed (the kernel will reject dedupe commands while send is active,
and bees currently will not re-issue them after the send is complete).
It may be preferable to terminate the bees process while running `btrfs
send` in extent scan mode, and restart bees after the `send` is complete.
Threads and load management
---------------------------
By default, bees creates one worker thread for each CPU detected.
These threads then perform scanning and dedupe operations. The number of
worker threads can be set with the [`--thread-count` and `--thread-factor`
options](options.md).
By default, bees creates one worker thread for each CPU detected. These
threads then perform scanning and dedupe operations. bees attempts to
maximize the amount of productive work each thread does, until either the
threads are all continuously busy, or there is no remaining work to do.
If desired, bees can automatically increase or decrease the number
of worker threads in response to system load. This reduces impact on
the rest of the system by pausing bees when other CPU and IO intensive
loads are active on the system, and resumes bees when the other loads
are inactive. This is configured with the [`--loadavg-target` and
`--thread-min` options](options.md).
In many cases it is not desirable to continually run bees at maximum
performance. Maximum performance is not necessary if bees can dedupe
new data faster than it appears on the filesystem. If it only takes
bees 10 minutes per day to dedupe all new data on a filesystem, then
bees doesn't need to run for more than 10 minutes per day.
bees supports a number of options for reducing system load:
* Run bees for a few hours per day, at an off-peak time (i.e. during
a maintenace window), instead of running bees continuously. Any data
added to the filesystem while bees is not running will be scanned when
bees restarts. At the end of the maintenance window, terminate the
bees process with SIGTERM to write the hash table and scan position
for the next maintenance window.
* Temporarily pause bees operation by sending the bees process SIGUSR1,
and resume operation with SIGUSR2. This is preferable to freezing
and thawing the process, e.g. with freezer cgroups or SIGSTOP/SIGCONT
signals, because it allows bees to close open file handles that would
otherwise prevent those files from being deleted while bees is frozen.
* Reduce the number of worker threads with the [`--thread-count` or
`--thread-factor` options](options.md). This simply leaves CPU cores
idle so that other applications on the host can use them, or to save
power.
* Allow bees to automatically track system load and increase or decrease
the number of threads to reach a target system load. This reduces
impact on the rest of the system by pausing bees when other CPU and IO
intensive loads are active on the system, and resumes bees when the other
loads are inactive. This is configured with the [`--loadavg-target`
and `--thread-min` options](options.md).
* Allow bees to self-throttle operations that enqueue delayed work
within btrfs. These operations are not well controlled by Linux
features such as process priority or IO priority or IO rate-limiting,
because the enqueued work is submitted to btrfs several seconds before
btrfs performs the work. By the time btrfs performs the work, it's too
late for external throttling to be effective. The [`--throttle-factor`
option](options.md) tracks how long it takes btrfs to complete queued
operations, and reduces bees's queued work submission rate to match
btrfs's queued work completion rate (or a fraction thereof, to reduce
system load).
Log verbosity
-------------

View File

@@ -120,10 +120,14 @@ The `crawl` event group consists of operations related to scanning btrfs trees t
* `crawl_again`: An inode crawl was restarted because the extent was already locked by another running crawl.
* `crawl_blacklisted`: An extent was not scanned because it belongs to a blacklisted file.
* `crawl_create`: A new subvol crawler was created.
* `crawl_done`: One pass over all subvols on the filesystem was completed.
* `crawl_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
* `crawl_done`: One pass over a subvol was completed.
* `crawl_discard_high`: An extent that was too large for the crawler's size tier was discarded.
* `crawl_discard_low`: An extent that was too small for the crawler's size tier was discarded.
* `crawl_empty`: A `TREE_SEARCH_V2` ioctl call failed or returned an empty set (usually because all data in the subvol was scanned).
* `crawl_extent`: The extent crawler queued all references to an extent for processing.
* `crawl_fail`: A `TREE_SEARCH_V2` ioctl call failed.
* `crawl_flop`: Small extent items were not skipped because the next extent started at or before the end of the previous extent.
* `crawl_gen_high`: An extent item in the search results refers to an extent that is newer than the current crawl's `max_transid` allows.
* `crawl_gen_low`: An extent item in the search results refers to an extent that is older than the current crawl's `min_transid` allows.
* `crawl_hole`: An extent item in the search results refers to a hole.
@@ -135,8 +139,13 @@ The `crawl` event group consists of operations related to scanning btrfs trees t
* `crawl_prealloc`: An extent item in the search results refers to a `PREALLOC` extent.
* `crawl_push`: An extent item in the search results is suitable for scanning and deduplication.
* `crawl_scan`: An extent item in the search results is submitted to `BeesContext::scan_forward` for scanning and deduplication.
* `crawl_skip`: Small extent items were skipped because no extent of sufficient size was found within the minimum search distance.
* `crawl_skip_ms`: Time spent skipping small extent items.
* `crawl_search`: A `TREE_SEARCH_V2` ioctl call was successful.
* `crawl_throttled`: Extent scan created too many work queue items and was prevented from creating any more.
* `crawl_tree_block`: Extent scan found and skipped a metadata tree block.
* `crawl_unknown`: An extent item in the search results has an unrecognized type.
* `crawl_unthrottled`: Extent scan allowed to create work queue items again.
dedup
-----
@@ -162,6 +171,25 @@ The `exception` event group consists of C++ exceptions. C++ exceptions are thro
* `exception_caught`: Total number of C++ exceptions thrown and caught by a generic exception handler.
* `exception_caught_silent`: Total number of "silent" C++ exceptions thrown and caught by a generic exception handler. These are exceptions which are part of the correct and normal operation of bees. The exceptions are logged at a lower log level.
extent
------
The `extent` event group consists of events that occur within the extent scanner.
* `extent_deferred_inode`: A lock conflict was detected when two worker threads attempted to manipulate the same inode at the same time.
* `extent_empty`: A complete list of references to an extent was created but the list was empty, e.g. because all refs are in deleted inodes or snapshots.
* `extent_fail`: An ioctl call to `LOGICAL_INO` failed.
* `extent_forward`: An extent reference was submitted for scanning.
* `extent_mapped`: A complete map of references to an extent was created and added to the crawl queue.
* `extent_ok`: An ioctl call to `LOGICAL_INO` completed successfully.
* `extent_overflow`: A complete map of references to an extent exceeded `BEES_MAX_EXTENT_REF_COUNT`, so the extent was dropped.
* `extent_ref_missing`: An extent reference reported by `LOGICAL_INO` was not found by later `TREE_SEARCH_V2` calls.
* `extent_ref_ok`: One extent reference was queued for scanning.
* `extent_restart`: An extent reference was requeued to be scanned again after an active extent lock is released.
* `extent_retry`: An extent reference was requeued to be scanned again after an active inode lock is released.
* `extent_skip`: A 4K extent with more than 1000 refs was skipped.
* `extent_zero`: An ioctl call to `LOGICAL_INO` succeeded, but reported an empty list of extents.
hash
----
@@ -180,24 +208,6 @@ The `hash` event group consists of operations related to the bees hash table.
* `hash_insert`: A `(hash, address)` pair was inserted by `BeesHashTable::push_random_hash_addr`.
* `hash_lookup`: The hash table was searched for `(hash, address)` pairs matching a given `hash`.
inserted
--------
The `inserted` event group consists of operations related to storing hash and address data in the hash table (i.e. the hash table client).
* `inserted_block`: Total number of data block references scanned and inserted into the hash table.
* `inserted_clobbered`: Total number of data block references scanned and eliminated from the filesystem.
matched
-------
The `matched` event group consists of events related to matching incoming data blocks against existing hash table entries.
* `matched_0`: A data block was scanned, hash table entries found, but no matching data blocks on the filesytem located.
* `matched_1_or_more`: A data block was scanned, hash table entries found, and one or more matching data blocks on the filesystem located.
* `matched_2_or_more`: A data block was scanned, hash table entries found, and two or more matching data blocks on the filesystem located.
* `matched_3_or_more`: A data block was scanned, hash table entries found, and three or more matching data blocks on the filesystem located.
open
----
@@ -259,12 +269,29 @@ The `pairforward` event group consists of events related to extending matching b
* `pairforward_try`: Started extending a pair of matching block ranges forward.
* `pairforward_zero`: A pair of matching block ranges could not be extended backward by one block because the src block contained all zeros and was not compressed.
progress
--------
The `progress` event group consists of events related to progress estimation.
* `progress_no_data_bg`: Failed to retrieve any data block groups from the filesystem.
* `progress_not_created`: A crawler for one size tier had not been created for the extent scanner.
* `progress_complete`: A crawler for one size tier has completed a scan.
* `progress_not_found`: The extent position for a crawler does not correspond to any block group.
* `progress_out_of_bg`: The extent position for a crawler does not correspond to any data block group.
* `progress_ok`: Table of progress and ETA created successfully.
readahead
---------
The `readahead` event group consists of events related to calls to `posix_fadvise`.
The `readahead` event group consists of events related to data prefetching (formerly calls to `posix_fadvise` or `readahead`, but now emulated in userspace).
* `readahead_ms`: Total time spent running `posix_fadvise(..., POSIX_FADV_WILLNEED)` aka `readahead()`.
* `readahead_bytes`: Number of bytes prefetched.
* `readahead_count`: Number of read calls.
* `readahead_clear`: Number of times the duplicate read cache was cleared.
* `readahead_fail`: Number of read errors during prefetch.
* `readahead_ms`: Total time spent emulating readahead in user-space (kernel readahead is not measured).
* `readahead_skip`: Number of times a duplicate read was identified in the cache and skipped.
* `readahead_unread_ms`: Total time spent running `posix_fadvise(..., POSIX_FADV_DONTNEED)`.
replacedst
@@ -301,7 +328,7 @@ The `resolve` event group consists of operations related to translating a btrfs
* `resolve_large`: The `LOGICAL_INO` ioctl returned more than 2730 results (the limit of the v1 ioctl).
* `resolve_ms`: Total time spent in the `LOGICAL_INO` ioctl (i.e. wallclock time, not kernel CPU time).
* `resolve_ok`: The `LOGICAL_INO` ioctl returned success.
* `resolve_overflow`: The `LOGICAL_INO` ioctl returned more than 655050 extents (the limit of the v2 ioctl).
* `resolve_overflow`: The `LOGICAL_INO` ioctl returned 9999 or more extents (the limit configured in `bees.h`).
* `resolve_toxic`: The `LOGICAL_INO` ioctl took more than 0.1 seconds of kernel CPU time.
root
@@ -329,35 +356,38 @@ The `scan` event group consists of operations related to scanning incoming data.
* `scan_blacklisted`: A blacklisted extent was passed to `scan_forward` and dropped.
* `scan_block`: A block of data was scanned.
* `scan_bump`: After deduping a block range, the scan pointer had to be moved past the end of the deduped byte range.
* `scan_dup_block`: Number of duplicate blocks deduped.
* `scan_dup_hit`: A pair of duplicate block ranges was found and removed.
* `scan_compressed_no_dedup`: An extent that was compressed contained non-zero, non-duplicate data.
* `scan_dup_block`: Number of duplicate block references deduped.
* `scan_dup_hit`: A pair of duplicate block ranges was found.
* `scan_dup_miss`: A pair of duplicate blocks was found in the hash table but not in the filesystem.
* `scan_eof`: Scan past EOF was attempted.
* `scan_erase_redundant`: Blocks in the hash table were removed because they were removed from the filesystem by dedupe.
* `scan_extent`: An extent was scanned (`scan_one_extent`).
* `scan_extent_tiny`: An extent below 128K that was not the beginning or end of a file was scanned. No action is currently taken for these--they are merely counted.
* `scan_forward`: A logical byte range was scanned (`scan_forward`).
* `scan_found`: An entry was found in the hash table matching a scanned block from the filesystem.
* `scan_hash_hit`: A block was found on the filesystem corresponding to a block found in the hash table.
* `scan_hash_miss`: A block was not found on the filesystem corresponding to a block found in the hash table.
* `scan_hash_preinsert`: A block was prepared for insertion into the hash table.
* `scan_hash_preinsert`: A non-zero data block's hash was prepared for possible insertion into the hash table.
* `scan_hash_insert`: A non-zero data block's hash was inserted into the hash table.
* `scan_hole`: A hole extent was found during scan and ignored.
* `scan_interesting`: An extent had flags that were not recognized by bees and was ignored.
* `scan_lookup`: A hash was looked up in the hash table.
* `scan_malign`: A block being scanned matched a hash at EOF in the hash table, but the EOF was not aligned to a block boundary and the two blocks did not have the same length.
* `scan_no_fd`: References to a block from the hash table were found, but a FD could not be opened.
* `scan_no_rewrite`: All blocks in an extent were removed by dedupe (i.e. no copies).
* `scan_push_front`: An entry in the hash table matched a duplicate block, so the entry was moved to the head of its LRU list.
* `scan_reinsert`: A copied block's hash and block address was inserted into the hash table.
* `scan_resolve_hit`: A block address in the hash table was successfully resolved to an open FD and offset pair.
* `scan_resolve_zero`: A block address in the hash table was not resolved to any subvol/inode pair, so the corresponding hash table entry was removed.
* `scan_rewrite`: A range of bytes in a file was copied, then the copy deduped over the original data.
* `scan_root_dead`: A deleted subvol was detected.
* `scan_seen_clear`: The list of recently scanned extents reached maximum size and was cleared.
* `scan_seen_erase`: An extent reference was modified by scan, so all future references to the extent must be scanned.
* `scan_seen_hit`: A scan was skipped because the same extent had recently been scanned.
* `scan_seen_insert`: An extent reference was not modified by scan and its hashes have been inserted into the hash table, so all future references to the extent can be ignored.
* `scan_seen_miss`: A scan was not skipped because the same extent had not recently been scanned (i.e. the extent was scanned normally).
* `scan_skip_bytes`: Nuisance dedupe or hole-punching would save less than half of the data in an extent.
* `scan_skip_ops`: Nuisance dedupe or hole-punching would require too many dedupe/copy/hole-punch operations in an extent.
* `scan_toxic_hash`: A scanned block has the same hash as a hash table entry that is marked toxic.
* `scan_toxic_match`: A hash table entry points to a block that is discovered to be toxic.
* `scan_twice`: Two references to the same block have been found in the hash table.
* `scan_zero_compressed`: An extent that was compressed and contained only zero bytes was found.
* `scan_zero_uncompressed`: A block that contained only zero bytes was found in an uncompressed extent.
* `scan_zero`: A data block containing only zero bytes was detected.
scanf
-----
@@ -365,9 +395,10 @@ scanf
The `scanf` event group consists of operations related to `BeesContext::scan_forward`. This is the entry point where `crawl` schedules new data for scanning.
* `scanf_deferred_extent`: Two tasks attempted to scan the same extent at the same time, so one was deferred.
* `scanf_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
* `scanf_eof`: Scan past EOF was attempted.
* `scanf_extent`: A btrfs extent item was scanned.
* `scanf_extent_ms`: Total thread-seconds spent scanning btrfs extent items.
* `scanf_no_fd`: References to a block from the hash table were found, but a FD could not be opened.
* `scanf_total`: A logical byte range of a file was scanned.
* `scanf_total_ms`: Total thread-seconds spent scanning logical byte ranges.

View File

@@ -205,7 +205,7 @@ Other Gotchas
* bees avoids the [slow backrefs kernel bug](btrfs-kernel.md) by
measuring the time required to perform `LOGICAL_INO` operations.
If an extent requires over 0.1 kernel CPU seconds to perform a
If an extent requires over 5.0 kernel CPU seconds to perform a
`LOGICAL_INO` ioctl, then bees blacklists the extent and avoids
referencing it in future operations. In most cases, fewer than 0.1%
of extents in a filesystem must be avoided this way. This results

View File

@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
About bees
----------
bees is a block-oriented userspace deduplication agent designed for large
btrfs filesystems. It is an offline dedupe combined with an incremental
data scan capability to minimize time data spends on disk from write
to dedupe.
bees is a block-oriented userspace deduplication agent designed to scale
up to large btrfs filesystems. It is an offline dedupe combined with
an incremental data scan capability to minimize time data spends on disk
from write to dedupe.
Strengths
---------
* Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
* Daemon incrementally dedupes new data using btrfs tree search
* Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
* Daemon mode - incrementally dedupes new data as it appears
* Largest extents first - recover more free space during fixed maintenance windows
* Works with btrfs compression - dedupe any combination of compressed and uncompressed files
* Works around btrfs filesystem structure to free more disk space
* Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
* Persistent hash table for rapid restart after shutdown
* Whole-filesystem dedupe - including snapshots
* Constant hash table size - no increased RAM usage if data set becomes larger
* Works on live data - no scheduled downtime required
* Automatic self-throttling based on system load
* Automatic self-throttling - reduces system load
* btrfs support - recovers more free space from btrfs than naive dedupers
Weaknesses
----------
* Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
* Requires root privilege (or `CAP_SYS_ADMIN`)
* First run may require temporary disk space for extent reorganization
* Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
* [First run may increase metadata space usage if many snapshots exist](gotchas.md)
* Constant hash table size - no decreased RAM usage if data set becomes smaller
* btrfs only
@@ -46,7 +46,7 @@ Recommended Reading
-------------------
* [bees Gotchas](gotchas.md)
* [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING
* [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
* [bees vs. other btrfs features](btrfs-other.md)
* [What to do when something goes wrong](wrong.md)
@@ -69,6 +69,6 @@ You can also use Github:
Copyright & License
-------------------
Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.
GPL (version 3 or later).

View File

@@ -15,16 +15,9 @@ specific files (patches welcome).
* PREALLOC extents and extents containing blocks filled with zeros will
be replaced by holes. There is no way to turn this off.
* Consecutive runs of duplicate blocks that are less than 12K in length
can take 30% of the processing time while saving only 3% of the disk
space. There should be an option to just not bother with those, but it's
complicated by the btrfs requirement to always dedupe complete extents.
* There is a lot of duplicate reading of blocks in snapshots. bees will
scan all snapshots at close to the same time to try to get better
performance by caching, but really fixing this requires rewriting the
crawler to scan the btrfs extent tree directly instead of the subvol
FS trees.
* The fundamental unit of deduplication is the extent _reference_, when
it should be the _extent_ itself. This is an architectural limitation
that results in excess reads of extent data, even in the Extent scan mode.
* Block reads are currently more allocation- and CPU-intensive than they
should be, especially for filesystems on SSD where the IO overhead is
@@ -33,8 +26,9 @@ much smaller. This is a problem for CPU-power-constrained environments
* bees can currently fragment extents when required to remove duplicate
blocks, but has no defragmentation capability yet. When possible, bees
will attempt to work with existing extent boundaries, but it will not
aggregate blocks together from multiple extents to create larger ones.
will attempt to work with existing extent boundaries and choose the
largest fragments available, but it will not aggregate blocks together
from multiple extents to create larger ones.
* When bees fragments an extent, the copied data is compressed. There
is currently no way (other than by modifying the source) to select a

View File

@@ -36,6 +36,34 @@
Has no effect unless `--loadavg-target` is used to specify a target load.
* `--throttle-factor FACTOR`
In order to avoid saturating btrfs deferred work queues, bees tracks
the time that operations with delayed effect (dedupe and tmpfile copy)
and operations with long run times (`LOGICAL_INO`) run. If an operation
finishes before the average run time for that operation, bees will
sleep for the remainder of the average run time, so that operations
are submitted to btrfs at a rate similar to the rate that btrfs can
complete them.
The `FACTOR` is multiplied by the average run time for each operation
to calculate the target delay time.
`FACTOR` 0 is the default, which adds no delays. bees will attempt
to saturate btrfs delayed work queues as quickly as possible, which
may impact other processes on the same filesystem, or even slow down
bees itself.
`FACTOR` 1.0 will attempt to keep btrfs delayed work queues filled at
a steady average rate.
`FACTOR` more than 1.0 will add delays longer than the average
run time (e.g. 10.0 will delay all operations that take less than 10x
the average run time). High values of `FACTOR` may be desirable when
using bees with other applications on the same filesystem.
The maximum delay per operation is 60 seconds.
## Filesystem tree traversal options
* `--scan-mode MODE` or `-m`
@@ -47,6 +75,7 @@
* Mode 1: independent
* Mode 2: sequential
* Mode 3: recent
* Mode 4: extent
For details of the different scanning modes and the default value of
this option, see [bees configuration](config.md).
@@ -55,19 +84,22 @@
* `--workaround-btrfs-send` or `-a`
_This option is obsolete and should not be used any more._
Pretend that read-only snapshots are empty and silently discard any
request to dedupe files referenced through them. This is a workaround for
[problems with the kernel implementation of `btrfs send` and `btrfs send
request to dedupe files referenced through them. This is a workaround
for [problems with old kernels running `btrfs send` and `btrfs send
-p`](btrfs-kernel.md) which make these btrfs features unusable with bees.
This option should be used to avoid breaking `btrfs send` on the same
filesystem.
This option was used to avoid breaking `btrfs send` on old kernels.
The affected kernels are now too old to be recommended for use with bees.
bees now waits for `btrfs send` to finish. There is no need for an
option to enable this.
**Note:** There is a _significant_ space tradeoff when using this option:
it is likely no space will be recovered--and possibly significant extra
space used--until the read-only snapshots are deleted. On the other
hand, if snapshots are rotated frequently then bees will spend less time
scanning them.
space used--until the read-only snapshots are deleted.
## Logging options

View File

@@ -75,9 +75,8 @@ in the shell script that launches `bees`:
schedtool -D -n20 $$
ionice -c3 -p $$
You can also use the [`--loadavg-target` and `--thread-min`
options](options.md) to further control the impact of bees on the rest
of the system.
You can also use the [load management options](options.md) to further
control the impact of bees on the rest of the system.
Let the bees fly:

View File

@@ -4,16 +4,13 @@ What to do when something goes wrong with bees
Hangs and excessive slowness
----------------------------
### Are you using qgroups or autodefrag?
Read about [bad btrfs feature interactions](btrfs-other.md).
### Use load-throttling options
If bees is just more aggressive than you would like, consider using
[load throttling options](options.md). These are usually more effective
than `ionice`, `schedtool`, and the `blkio` cgroup (though you can
certainly use those too).
certainly use those too) because they limit work that bees queues up
for later execution inside btrfs.
### Check `$BEESSTATUS`
@@ -52,10 +49,6 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li
Thread names of note:
* `crawl_12345`: scan/dedupe worker threads (the number is the subvol
ID which the thread is currently working on). These threads appear
and disappear from the status dynamically according to the requirements
of the work queue and loadavg throttling.
* `bees`: main thread (doesn't do anything after startup, but its task execution time is that of the whole bees process)
* `crawl_master`: task that finds new extents in the filesystem and populates the work queue
* `crawl_transid`: btrfs transid (generation number) tracker and polling thread
@@ -64,6 +57,13 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li
* `hash_writeback`: trickle-writes the hash table back to `beeshash.dat`
* `hash_prefetch`: prefetches the hash table at startup and updates `beesstats.txt` hourly
Most other threads have names that are derived from the current dedupe
task that they are executing:
* `ref_205ad76b1000_24K_50`: extent scan performing dedupe of btrfs extent bytenr `205ad76b1000`, which is 24 KiB long and has 50 references
* `extent_250_32M_16E`: extent scan searching for extents between 32 MiB + 1 and 16 EiB bytes long, tracking scan position in virtual subvol `250`.
* `crawl_378_18916`: subvol scan searching for extent refs in subvol `378`, inode `18916`.
### Dump kernel stacks of hung processes
Check the kernel stacks of all blocked kernel processes:
@@ -91,7 +91,7 @@ bees Crashes
(gdb) thread apply all bt full
The last line generates megabytes of output and will often crash gdb.
This is OK, submit whatever output gdb can produce.
Submit whatever output gdb can produce.
**Note that this output may include filenames or data from your
filesystem.**
@@ -160,8 +160,7 @@ Kernel crashes, corruption, and filesystem damage
-------------------------------------------------
bees doesn't do anything that _should_ cause corruption or data loss;
however, [btrfs has kernel bugs](btrfs-kernel.md) and [interacts poorly
with some Linux block device layers](btrfs-other.md), so corruption is
however, [btrfs has kernel bugs](btrfs-kernel.md), so corruption is
not impossible.
Issues with the btrfs filesystem kernel code or other block device layers

View File

@@ -64,11 +64,13 @@ namespace crucible {
/// @{ Extent items (EXTENT_ITEM)
uint64_t extent_begin() const;
uint64_t extent_end() const;
uint64_t extent_flags() const;
uint64_t extent_generation() const;
/// @}
/// @{ Root items
uint64_t root_flags() const;
uint64_t root_refs() const;
/// @}
/// @{ Root backref items.
@@ -108,7 +110,9 @@ namespace crucible {
virtual ~BtrfsTreeFetcher() = default;
BtrfsTreeFetcher(Fd new_fd);
void type(uint8_t type);
uint8_t type();
void tree(uint64_t tree);
uint64_t tree();
void transid(uint64_t min_transid, uint64_t max_transid = numeric_limits<uint64_t>::max());
/// Block size (sectorsize) of filesystem
uint64_t block_size() const;
@@ -169,34 +173,42 @@ namespace crucible {
void get_sums(uint64_t logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t count)> output);
};
/// Fetch extent items from extent tree
/// Fetch extent items from extent tree.
/// Does not filter out metadata! See BtrfsDataExtentTreeFetcher for that.
class BtrfsExtentItemFetcher : public BtrfsTreeObjectFetcher {
public:
BtrfsExtentItemFetcher(const Fd &fd);
};
/// Fetch extent refs from an inode
/// Fetch extent refs from an inode. Caller must set the tree and objectid.
class BtrfsExtentDataFetcher : public BtrfsTreeOffsetFetcher {
public:
BtrfsExtentDataFetcher(const Fd &fd);
};
/// Fetch inodes from a subvol
class BtrfsFsTreeFetcher : public BtrfsTreeObjectFetcher {
public:
BtrfsFsTreeFetcher(const Fd &fd, uint64_t subvol);
};
/// Fetch raw inode items
class BtrfsInodeFetcher : public BtrfsTreeObjectFetcher {
public:
BtrfsInodeFetcher(const Fd &fd);
BtrfsTreeItem stat(uint64_t subvol, uint64_t inode);
};
/// Fetch a root (subvol) item
class BtrfsRootFetcher : public BtrfsTreeObjectFetcher {
public:
BtrfsRootFetcher(const Fd &fd);
BtrfsTreeItem root(uint64_t subvol);
BtrfsTreeItem root_backref(uint64_t subvol);
};
/// Fetch data extent items from extent tree, skipping metadata-only block groups
class BtrfsDataExtentTreeFetcher : public BtrfsExtentItemFetcher {
BtrfsTreeItem m_current_bg;
BtrfsTreeOffsetFetcher m_chunk_tree;
protected:
virtual void next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr) override;
public:
BtrfsDataExtentTreeFetcher(const Fd &fd);
};
}

View File

@@ -78,9 +78,6 @@ enum btrfs_compression_type {
#define BTRFS_SHARED_BLOCK_REF_KEY 182
#define BTRFS_SHARED_DATA_REF_KEY 184
#define BTRFS_BLOCK_GROUP_ITEM_KEY 192
#define BTRFS_FREE_SPACE_INFO_KEY 198
#define BTRFS_FREE_SPACE_EXTENT_KEY 199
#define BTRFS_FREE_SPACE_BITMAP_KEY 200
#define BTRFS_DEV_EXTENT_KEY 204
#define BTRFS_DEV_ITEM_KEY 216
#define BTRFS_CHUNK_ITEM_KEY 228
@@ -97,6 +94,18 @@ enum btrfs_compression_type {
#endif
#ifndef BTRFS_FREE_SPACE_INFO_KEY
#define BTRFS_FREE_SPACE_INFO_KEY 198
#define BTRFS_FREE_SPACE_EXTENT_KEY 199
#define BTRFS_FREE_SPACE_BITMAP_KEY 200
#define BTRFS_FREE_SPACE_OBJECTID -11ULL
#endif
#ifndef BTRFS_BLOCK_GROUP_RAID1C4
#define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
#define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
#endif
#ifndef BTRFS_DEFRAG_RANGE_START_IO
// For some reason uapi has BTRFS_DEFRAG_RANGE_COMPRESS and

View File

@@ -55,7 +55,6 @@ namespace crucible {
Pointer m_ptr;
size_t m_size = 0;
mutable mutex m_mutex;
friend ostream & operator<<(ostream &os, const ByteVector &bv);
};
template <class T>
@@ -74,6 +73,8 @@ namespace crucible {
THROW_CHECK2(out_of_range, size(), sizeof(T), size() >= sizeof(T));
return reinterpret_cast<T*>(data());
}
ostream& operator<<(ostream &os, const ByteVector &bv);
}
#endif // _CRUCIBLE_BYTEVECTOR_H_

View File

@@ -197,11 +197,17 @@ namespace crucible {
size_t m_buf_size;
set<BtrfsIoctlSearchHeader> m_result;
static thread_local size_t s_calls;
static thread_local size_t s_loops;
static thread_local size_t s_loops_empty;
static thread_local shared_ptr<ostream> s_debug_ostream;
};
ostream & operator<<(ostream &os, const btrfs_ioctl_search_key &key);
ostream & operator<<(ostream &os, const BtrfsIoctlSearchKey &key);
string btrfs_chunk_type_ntoa(uint64_t type);
string btrfs_search_type_ntoa(unsigned type);
string btrfs_search_objectid_ntoa(uint64_t objectid);
string btrfs_compress_type_ntoa(uint8_t type);
@@ -239,14 +245,14 @@ namespace crucible {
unsigned long available() const;
};
template<class V> ostream &hexdump(ostream &os, const V &v);
struct BtrfsIoctlFsInfoArgs : public btrfs_ioctl_fs_info_args_v3 {
BtrfsIoctlFsInfoArgs();
void do_ioctl(int fd);
bool do_ioctl_nothrow(int fd);
uint16_t csum_type() const;
uint16_t csum_size() const;
uint64_t generation() const;
vector<uint8_t> fsid() const;
};
ostream & operator<<(ostream &os, const BtrfsIoctlFsInfoArgs &a);

View File

@@ -12,12 +12,14 @@ namespace crucible {
ostream &
hexdump(ostream &os, const V &v)
{
os << "V { size = " << v.size() << ", data:\n";
for (size_t i = 0; i < v.size(); i += 8) {
const auto v_size = v.size();
const uint8_t* const v_data = reinterpret_cast<const uint8_t*>(v.data());
os << "V { size = " << v_size << ", data:\n";
for (size_t i = 0; i < v_size; i += 8) {
string hex, ascii;
for (size_t j = i; j < i + 8; ++j) {
if (j < v.size()) {
uint8_t c = v[j];
if (j < v_size) {
const uint8_t c = v_data[j];
char buf[8];
sprintf(buf, "%02x ", c);
hex += buf;

View File

@@ -117,7 +117,7 @@ namespace crucible {
while (full() || locked(name)) {
m_condvar.wait(lock);
}
auto rv = m_set.insert(make_pair(name, crucible::gettid()));
auto rv = m_set.insert(make_pair(name, gettid()));
THROW_CHECK0(runtime_error, rv.second);
}
@@ -129,7 +129,7 @@ namespace crucible {
if (full() || locked(name)) {
return false;
}
auto rv = m_set.insert(make_pair(name, crucible::gettid()));
auto rv = m_set.insert(make_pair(name, gettid()));
THROW_CHECK1(runtime_error, name, rv.second);
return true;
}

View File

@@ -14,6 +14,7 @@ namespace crucible {
mutex m_mutex;
condition_variable m_cv;
map<string, size_t> m_counters;
bool m_do_locking = true;
class LockHandle {
const string m_type;
@@ -33,6 +34,7 @@ namespace crucible {
shared_ptr<LockHandle> get_lock_private(const string &type);
public:
static shared_ptr<LockHandle> get_lock(const string &type);
static void enable_locking(bool enabled);
};
}

View File

@@ -0,0 +1,52 @@
#ifndef CRUCIBLE_OPENAT2_H
#define CRUCIBLE_OPENAT2_H
#include <cstdlib>
// Compatibility for building on old libc for new kernel
#include <linux/version.h>
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0)
#include <linux/openat2.h>
#else
#include <linux/types.h>
#ifndef RESOLVE_NO_XDEV
#define RESOLVE_NO_XDEV 1
// RESOLVE_NO_XDEV was there from the beginning of openat2,
// so if that's missing, so is open_how
struct open_how {
__u64 flags;
__u64 mode;
__u64 resolve;
};
#endif
#ifndef RESOLVE_NO_MAGICLINKS
#define RESOLVE_NO_MAGICLINKS 2
#endif
#ifndef RESOLVE_NO_SYMLINKS
#define RESOLVE_NO_SYMLINKS 4
#endif
#ifndef RESOLVE_BENEATH
#define RESOLVE_BENEATH 8
#endif
#ifndef RESOLVE_IN_ROOT
#define RESOLVE_IN_ROOT 16
#endif
#endif // Linux version >= v5.6
extern "C" {
/// Weak symbol to support libc with no syscall wrapper
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size) throw();
};
#endif // CRUCIBLE_OPENAT2_H

View File

@@ -10,6 +10,10 @@
#include <sys/wait.h>
#include <unistd.h>
extern "C" {
pid_t gettid() throw();
};
namespace crucible {
using namespace std;
@@ -73,7 +77,6 @@ namespace crucible {
typedef ResourceHandle<Process::id, Process> Pid;
pid_t gettid();
double getloadavg1();
double getloadavg5();
double getloadavg15();

View File

@@ -6,23 +6,23 @@
#include <algorithm>
#include <limits>
#include <cstdint>
#if 1
// Debug stream
#include <memory>
#include <iostream>
#include <sstream>
#define DINIT(__x) __x
#define DLOG(__x) do { logs << __x << std::endl; } while (false)
#define DOUT(__err) do { __err << logs.str(); } while (false)
#else
#define DINIT(__x) do {} while (false)
#define DLOG(__x) do {} while (false)
#define DOUT(__x) do {} while (false)
#endif
#include <cstdint>
namespace crucible {
using namespace std;
extern thread_local shared_ptr<ostream> tl_seeker_debug_str;
#define SEEKER_DEBUG_LOG(__x) do { \
if (tl_seeker_debug_str) { \
(*tl_seeker_debug_str) << __x << "\n"; \
} \
} while (false)
// Requirements for Container<Pos> Fetch(Pos lower, Pos upper):
// - fetches objects in Pos order, starting from lower (must be >= lower)
// - must return upper if present, may or may not return objects after that
@@ -49,113 +49,108 @@ namespace crucible {
Pos
seek_backward(Pos const target_pos, Fetch fetch, Pos min_step = 1, size_t max_loops = numeric_limits<size_t>::max())
{
DINIT(ostringstream logs);
try {
static const Pos end_pos = numeric_limits<Pos>::max();
// TBH this probably won't work if begin_pos != 0, i.e. any signed type
static const Pos begin_pos = numeric_limits<Pos>::min();
// Run a binary search looking for the highest key below target_pos.
// Initial upper bound of the search is target_pos.
// Find initial lower bound by doubling the size of the range until a key below target_pos
// is found, or the lower bound reaches the beginning of the search space.
// If the lower bound search reaches the beginning of the search space without finding a key,
// return the beginning of the search space; otherwise, perform a binary search between
// the bounds now established.
Pos lower_bound = 0;
Pos upper_bound = target_pos;
bool found_low = false;
Pos probe_pos = target_pos;
// We need one loop for each bit of the search space to find the lower bound,
// one loop for each bit of the search space to find the upper bound,
// and one extra loop to confirm the boundary is correct.
for (size_t loop_count = min(numeric_limits<Pos>::digits * size_t(2) + 1, max_loops); loop_count; --loop_count) {
DLOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
auto result = fetch(probe_pos, target_pos);
const Pos low_pos = result.empty() ? end_pos : *result.begin();
const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
DLOG(" = " << low_pos << ".." << high_pos);
// check for correct behavior of the fetch function
THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
if (!found_low) {
// if target_pos == end_pos then we will find it in every empty result set,
// so in that case we force the lower bound to be lower than end_pos
if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
// found a lower bound, set the low bound there and switch to binary search
found_low = true;
lower_bound = low_pos;
DLOG("found_low = true, lower_bound = " << lower_bound);
} else {
// still looking for lower bound
// if probe_pos was begin_pos then we can stop with no result
if (probe_pos == begin_pos) {
DLOG("return: probe_pos == begin_pos " << begin_pos);
return begin_pos;
}
// double the range size, or use the distance between objects found so far
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
// already checked low_pos <= high_pos above
const Pos want_delta = max(upper_bound - probe_pos, min_step);
// avoid underflowing the beginning of the search space
const Pos have_delta = min(want_delta, probe_pos - begin_pos);
THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
// move probe and try again
probe_pos = probe_pos - have_delta;
DLOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
continue;
static const Pos end_pos = numeric_limits<Pos>::max();
// TBH this probably won't work if begin_pos != 0, i.e. any signed type
static const Pos begin_pos = numeric_limits<Pos>::min();
// Run a binary search looking for the highest key below target_pos.
// Initial upper bound of the search is target_pos.
// Find initial lower bound by doubling the size of the range until a key below target_pos
// is found, or the lower bound reaches the beginning of the search space.
// If the lower bound search reaches the beginning of the search space without finding a key,
// return the beginning of the search space; otherwise, perform a binary search between
// the bounds now established.
Pos lower_bound = 0;
Pos upper_bound = target_pos;
bool found_low = false;
Pos probe_pos = target_pos;
// We need one loop for each bit of the search space to find the lower bound,
// one loop for each bit of the search space to find the upper bound,
// and one extra loop to confirm the boundary is correct.
for (size_t loop_count = min((1 + numeric_limits<Pos>::digits) * size_t(2), max_loops); loop_count; --loop_count) {
SEEKER_DEBUG_LOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
auto result = fetch(probe_pos, target_pos);
const Pos low_pos = result.empty() ? end_pos : *result.begin();
const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
SEEKER_DEBUG_LOG(" = " << low_pos << ".." << high_pos);
// check for correct behavior of the fetch function
THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
if (!found_low) {
// if target_pos == end_pos then we will find it in every empty result set,
// so in that case we force the lower bound to be lower than end_pos
if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
// found a lower bound, set the low bound there and switch to binary search
found_low = true;
lower_bound = low_pos;
SEEKER_DEBUG_LOG("found_low = true, lower_bound = " << lower_bound);
} else {
// still looking for lower bound
// if probe_pos was begin_pos then we can stop with no result
if (probe_pos == begin_pos) {
SEEKER_DEBUG_LOG("return: probe_pos == begin_pos " << begin_pos);
return begin_pos;
}
// double the range size, or use the distance between objects found so far
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
// already checked low_pos <= high_pos above
const Pos want_delta = max(upper_bound - probe_pos, min_step);
// avoid underflowing the beginning of the search space
const Pos have_delta = min(want_delta, probe_pos - begin_pos);
THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
// move probe and try again
probe_pos = probe_pos - have_delta;
SEEKER_DEBUG_LOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
continue;
}
if (low_pos <= target_pos && target_pos <= high_pos) {
// have keys on either side of target_pos in result
// search from the high end until we find the highest key below target
for (auto i = result.rbegin(); i != result.rend(); ++i) {
// more correctness checking for fetch
THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
if (*i <= target_pos) {
DLOG("return: *i " << *i << " <= target_pos " << target_pos);
return *i;
}
}
// if the list is empty then low_pos = high_pos = end_pos
// if target_pos = end_pos also, then we will execute the loop
// above but not find any matching entries.
THROW_CHECK0(runtime_error, result.empty());
}
if (target_pos <= low_pos) {
// results are all too high, so probe_pos..low_pos is too high
// lower the high bound to the probe pos
upper_bound = probe_pos;
DLOG("upper_bound = probe_pos " << probe_pos);
}
if (high_pos < target_pos) {
// results are all too low, so probe_pos..high_pos is too low
// raise the low bound to the high_pos
DLOG("lower_bound = high_pos " << high_pos);
lower_bound = high_pos;
}
// compute a new probe pos at the middle of the range and try again
// we can't have a zero-size range here because we would not have set found_low yet
THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
const Pos delta = (upper_bound - lower_bound) / 2;
probe_pos = lower_bound + delta;
if (delta < 1) {
// nothing can exist in the range (lower_bound, upper_bound)
// and an object is known to exist at lower_bound
DLOG("return: probe_pos == lower_bound " << lower_bound);
return lower_bound;
}
THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
DLOG("loop: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
}
THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
"found_low " << found_low);
} catch (...) {
DOUT(cerr);
throw;
if (low_pos <= target_pos && target_pos <= high_pos) {
// have keys on either side of target_pos in result
// search from the high end until we find the highest key below target
for (auto i = result.rbegin(); i != result.rend(); ++i) {
// more correctness checking for fetch
THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
if (*i <= target_pos) {
SEEKER_DEBUG_LOG("return: *i " << *i << " <= target_pos " << target_pos);
return *i;
}
}
// if the list is empty then low_pos = high_pos = end_pos
// if target_pos = end_pos also, then we will execute the loop
// above but not find any matching entries.
THROW_CHECK0(runtime_error, result.empty());
}
if (target_pos <= low_pos) {
// results are all too high, so probe_pos..low_pos is too high
// lower the high bound to the probe pos, low_pos cannot be lower
SEEKER_DEBUG_LOG("upper_bound = probe_pos " << probe_pos);
upper_bound = probe_pos;
}
if (high_pos < target_pos) {
// results are all too low, so probe_pos..high_pos is too low
// raise the low bound to high_pos but not above upper_bound
const auto next_pos = min(high_pos, upper_bound);
SEEKER_DEBUG_LOG("lower_bound = next_pos " << next_pos);
lower_bound = next_pos;
}
// compute a new probe pos at the middle of the range and try again
// we can't have a zero-size range here because we would not have set found_low yet
THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
const Pos delta = (upper_bound - lower_bound) / 2;
probe_pos = lower_bound + delta;
if (delta < 1) {
// nothing can exist in the range (lower_bound, upper_bound)
// and an object is known to exist at lower_bound
SEEKER_DEBUG_LOG("return: probe_pos == lower_bound " << lower_bound);
return lower_bound;
}
THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
SEEKER_DEBUG_LOG("loop bottom: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
}
THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
"found_low " << found_low);
}
}

106
include/crucible/table.h Normal file
View File

@@ -0,0 +1,106 @@
#ifndef CRUCIBLE_TABLE_H
#define CRUCIBLE_TABLE_H
#include <functional>
#include <limits>
#include <map>
#include <memory>
#include <ostream>
#include <sstream>
#include <string>
#include <vector>
namespace crucible {
namespace Table {
using namespace std;
using Content = function<string(size_t width, size_t height)>;
const size_t endpos = numeric_limits<size_t>::max();
Content Fill(const char c);
Content Text(const string& s);
template <class T>
Content Number(const T& num)
{
ostringstream oss;
oss << num;
return Text(oss.str());
}
class Cell {
Content m_content;
public:
Cell(const Content &fn = [](size_t, size_t) { return string(); } );
Cell& operator=(const Content &fn);
string text(size_t width, size_t height) const;
};
class Dimension {
size_t m_next_pos = 0;
vector<size_t> m_elements;
friend class Table;
size_t at(size_t) const;
public:
size_t size() const;
size_t insert(size_t pos);
void erase(size_t pos);
};
class Table {
Dimension m_rows, m_cols;
map<pair<size_t, size_t>, Cell> m_cells;
string m_left = "|";
string m_mid = "|";
string m_right = "|";
public:
Dimension &rows();
const Dimension& rows() const;
Dimension &cols();
const Dimension& cols() const;
Cell& at(size_t row, size_t col);
const Cell& at(size_t row, size_t col) const;
template <class T> void insert_row(size_t pos, const T& container);
template <class T> void insert_col(size_t pos, const T& container);
void left(const string &s);
void mid(const string &s);
void right(const string &s);
const string& left() const;
const string& mid() const;
const string& right() const;
};
ostream& operator<<(ostream &os, const Table &table);
template <class T>
void
Table::insert_row(size_t pos, const T& container)
{
const auto new_pos = m_rows.insert(pos);
size_t col = 0;
for (const auto &i : container) {
if (col >= cols().size()) {
cols().insert(col);
}
at(new_pos, col++) = i;
}
}
template <class T>
void
Table::insert_col(size_t pos, const T& container)
{
const auto new_pos = m_cols.insert(pos);
size_t row = 0;
for (const auto &i : container) {
if (row >= rows().size()) {
rows().insert(row);
}
at(row++, new_pos) = i;
}
}
}
}
#endif // CRUCIBLE_TABLE_H

View File

@@ -40,10 +40,17 @@ namespace crucible {
/// after the current instance exits.
void run() const;
/// Schedule task to run when no other Task is available.
void idle() const;
/// Schedule Task to run after this Task has run or
/// been destroyed.
void append(const Task &task) const;
/// Schedule Task to run after this Task has run or
/// been destroyed, in Task ID order.
void insert(const Task &task) const;
/// Describe Task as text.
string title() const;
@@ -163,15 +170,12 @@ namespace crucible {
/// (it is the ExclusionLock that owns the lock, so it can
/// be passed to other Tasks or threads, but this is not
/// recommended practice).
/// If not successful, current Task is appended to the
/// If not successful, the argument Task is appended to the
/// task that currently holds the lock. Current task is
/// expected to release any other ExclusionLock
/// expected to immediately release any other ExclusionLock
/// objects it holds, and exit its Task function.
ExclusionLock try_lock(const Task &task);
/// Execute Task when Exclusion is unlocked (possibly
/// immediately).
void insert_task(const Task &t);
};
/// Wrapper around pthread_setname_np which handles length limits

View File

@@ -34,7 +34,7 @@ namespace crucible {
double m_rate;
double m_burst;
double m_tokens = 0.0;
mutex m_mutex;
mutable mutex m_mutex;
void update_tokens();
RateLimiter() = delete;
@@ -45,6 +45,8 @@ namespace crucible {
double sleep_time(double cost = 1.0);
bool is_ready();
void borrow(double cost = 1.0);
void rate(double new_rate);
double rate() const;
};
class RateEstimator {
@@ -88,6 +90,9 @@ namespace crucible {
// Read count
uint64_t count() const;
/// Increment count (like update(count() + more), but atomic)
void increment(uint64_t more = 1);
// Convert counts to chrono types
chrono::high_resolution_clock::time_point time_point(uint64_t absolute_count) const;
chrono::duration<double> duration(uint64_t relative_count) const;

View File

@@ -14,9 +14,12 @@ CRUCIBLE_OBJS = \
fs.o \
multilock.o \
ntoa.o \
openat2.o \
path.o \
process.o \
seeker.o \
string.o \
table.o \
task.o \
time.o \
uname.o \

View File

@@ -5,6 +5,12 @@
#include "crucible/hexdump.h"
#include "crucible/seeker.h"
#define CRUCIBLE_BTRFS_TREE_DEBUG(x) do { \
if (BtrfsIoctlSearchKey::s_debug_ostream) { \
(*BtrfsIoctlSearchKey::s_debug_ostream) << x; \
} \
} while (false)
namespace crucible {
using namespace std;
@@ -22,6 +28,13 @@ namespace crucible {
return m_objectid + m_offset;
}
uint64_t
BtrfsTreeItem::extent_flags() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_ITEM_KEY);
return btrfs_get_member(&btrfs_extent_item::flags, m_data);
}
uint64_t
BtrfsTreeItem::extent_generation() const
{
@@ -61,6 +74,13 @@ namespace crucible {
return btrfs_get_member(&btrfs_root_item::flags, m_data);
}
uint64_t
BtrfsTreeItem::root_refs() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_ITEM_KEY);
return btrfs_get_member(&btrfs_root_item::refs, m_data);
}
ostream &
operator<<(ostream &os, const BtrfsTreeItem &bti)
{
@@ -269,12 +289,24 @@ namespace crucible {
m_type = type;
}
uint8_t
BtrfsTreeFetcher::type()
{
return m_type;
}
void
BtrfsTreeFetcher::tree(uint64_t tree)
{
m_tree = tree;
}
uint64_t
BtrfsTreeFetcher::tree()
{
return m_tree;
}
void
BtrfsTreeFetcher::transid(uint64_t min_transid, uint64_t max_transid)
{
@@ -329,6 +361,7 @@ namespace crucible {
BtrfsTreeItem
BtrfsTreeFetcher::at(uint64_t logical)
{
CRUCIBLE_BTRFS_TREE_DEBUG("at " << logical);
BtrfsIoctlSearchKey &sk = m_sk;
fill_sk(sk, logical);
// Exact match, should return 0 or 1 items
@@ -371,53 +404,59 @@ namespace crucible {
BtrfsTreeFetcher::rlower_bound(uint64_t logical)
{
#if 0
#define BTFRLB_DEBUG(x) do { cerr << x; } while (false)
static bool btfrlb_debug = getenv("BTFLRB_DEBUG");
#define BTFRLB_DEBUG(x) do { if (btfrlb_debug) cerr << x; } while (false)
#else
#define BTFRLB_DEBUG(x) do { } while (false)
#define BTFRLB_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
#endif
BtrfsTreeItem closest_item;
uint64_t closest_logical = 0;
BtrfsIoctlSearchKey &sk = m_sk;
size_t loops = 0;
BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << endl);
seek_backward(scale_logical(logical), [&](uint64_t lower_bound, uint64_t upper_bound) {
BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << " in tree " << tree() << endl);
seek_backward(scale_logical(logical), [&](uint64_t const lower_bound, uint64_t const upper_bound) {
++loops;
fill_sk(sk, unscale_logical(min(scaled_max_logical(), lower_bound)));
set<uint64_t> rv;
bool too_far = false;
do {
sk.nr_items = 4;
sk.do_ioctl(fd());
BTFRLB_DEBUG("fetch: loop " << loops << " lower_bound..upper_bound " << to_hex(lower_bound) << ".." << to_hex(upper_bound));
for (auto &i : sk.m_result) {
next_sk(sk, i);
const auto this_logical = hdr_logical(i);
const auto scaled_hdr_logical = scale_logical(this_logical);
BTFRLB_DEBUG(" " << to_hex(scaled_hdr_logical));
if (hdr_match(i)) {
if (this_logical <= logical && this_logical > closest_logical) {
closest_logical = this_logical;
closest_item = i;
}
BTFRLB_DEBUG("(match)");
rv.insert(scaled_hdr_logical);
}
if (scaled_hdr_logical > upper_bound || hdr_stop(i)) {
if (scaled_hdr_logical >= upper_bound) {
BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
}
if (hdr_stop(i)) {
rv.insert(numeric_limits<uint64_t>::max());
BTFRLB_DEBUG("(stop)");
}
// If hdr_stop or !hdr_match, don't inspect the item
if (hdr_stop(i)) {
too_far = true;
rv.insert(numeric_limits<uint64_t>::max());
BTFRLB_DEBUG("(stop)");
break;
} else {
BTFRLB_DEBUG("(cont'd)");
}
if (!hdr_match(i)) {
BTFRLB_DEBUG("(no match)");
continue;
}
const auto this_logical = hdr_logical(i);
BTFRLB_DEBUG(" " << to_hex(this_logical) << " " << i);
const auto scaled_hdr_logical = scale_logical(this_logical);
BTFRLB_DEBUG(" " << "(match)");
if (scaled_hdr_logical > upper_bound) {
too_far = true;
BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
break;
}
if (this_logical <= logical && this_logical > closest_logical) {
closest_logical = this_logical;
closest_item = i;
BTFRLB_DEBUG("(closest)");
}
rv.insert(scaled_hdr_logical);
BTFRLB_DEBUG("(cont'd)");
}
BTFRLB_DEBUG(endl);
// We might get a search result that contains only non-matching items.
// Keep looping until we find any matching item or we run out of tree.
} while (rv.empty() && !sk.m_result.empty());
} while (!too_far && rv.empty() && !sk.m_result.empty());
return rv;
}, scale_logical(lookbehind_size()));
return closest_item;
@@ -448,6 +487,7 @@ namespace crucible {
BtrfsTreeItem
BtrfsTreeFetcher::next(uint64_t logical)
{
CRUCIBLE_BTRFS_TREE_DEBUG("next " << logical);
const auto scaled_logical = scale_logical(logical);
if (scaled_logical + 1 > scaled_max_logical()) {
return BtrfsTreeItem();
@@ -458,6 +498,7 @@ namespace crucible {
BtrfsTreeItem
BtrfsTreeFetcher::prev(uint64_t logical)
{
CRUCIBLE_BTRFS_TREE_DEBUG("prev " << logical);
const auto scaled_logical = scale_logical(logical);
if (scaled_logical < 1) {
return BtrfsTreeItem();
@@ -542,9 +583,10 @@ namespace crucible {
BtrfsCsumTreeFetcher::get_sums(uint64_t const logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t bytes)> output)
{
#if 0
#define BCTFGS_DEBUG(x) do { cerr << x; } while (false)
static bool bctfgs_debug = getenv("BCTFGS_DEBUG");
#define BCTFGS_DEBUG(x) do { if (bctfgs_debug) cerr << x; } while (false)
#else
#define BCTFGS_DEBUG(x) do { } while (false)
#define BCTFGS_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
#endif
const uint64_t logical_end = logical + count * block_size();
BtrfsTreeItem bti = rlower_bound(logical);
@@ -636,14 +678,6 @@ namespace crucible {
type(BTRFS_EXTENT_DATA_KEY);
}
BtrfsFsTreeFetcher::BtrfsFsTreeFetcher(const Fd &new_fd, uint64_t subvol) :
BtrfsTreeObjectFetcher(new_fd)
{
tree(subvol);
type(BTRFS_EXTENT_DATA_KEY);
scale_size(1);
}
BtrfsInodeFetcher::BtrfsInodeFetcher(const Fd &fd) :
BtrfsTreeObjectFetcher(fd)
{
@@ -667,18 +701,86 @@ namespace crucible {
BtrfsTreeObjectFetcher(fd)
{
tree(BTRFS_ROOT_TREE_OBJECTID);
type(BTRFS_ROOT_ITEM_KEY);
scale_size(1);
}
BtrfsTreeItem
BtrfsRootFetcher::root(uint64_t subvol)
BtrfsRootFetcher::root(const uint64_t subvol)
{
const auto my_type = BTRFS_ROOT_ITEM_KEY;
type(my_type);
const auto item = at(subvol);
if (!!item) {
THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
THROW_CHECK2(runtime_error, item.type(), BTRFS_ROOT_ITEM_KEY, item.type() == BTRFS_ROOT_ITEM_KEY);
THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
}
return item;
}
BtrfsTreeItem
BtrfsRootFetcher::root_backref(const uint64_t subvol)
{
const auto my_type = BTRFS_ROOT_BACKREF_KEY;
type(my_type);
const auto item = at(subvol);
if (!!item) {
THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
}
return item;
}
BtrfsDataExtentTreeFetcher::BtrfsDataExtentTreeFetcher(const Fd &fd) :
BtrfsExtentItemFetcher(fd),
m_chunk_tree(fd)
{
tree(BTRFS_EXTENT_TREE_OBJECTID);
type(BTRFS_EXTENT_ITEM_KEY);
m_chunk_tree.tree(BTRFS_CHUNK_TREE_OBJECTID);
m_chunk_tree.type(BTRFS_CHUNK_ITEM_KEY);
m_chunk_tree.objectid(BTRFS_FIRST_CHUNK_TREE_OBJECTID);
}
void
BtrfsDataExtentTreeFetcher::next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr)
{
key.min_type = key.max_type = type();
key.max_objectid = key.max_offset = numeric_limits<uint64_t>::max();
key.min_offset = 0;
key.min_objectid = hdr.objectid;
const auto step = scale_size();
if (key.min_objectid < numeric_limits<uint64_t>::max() - step) {
key.min_objectid += step;
} else {
key.min_objectid = numeric_limits<uint64_t>::max();
}
// If we're still in our current block group, check here
if (!!m_current_bg) {
const auto bg_begin = m_current_bg.offset();
const auto bg_end = bg_begin + m_current_bg.chunk_length();
// If we are still in our current block group, return early
if (key.min_objectid >= bg_begin && key.min_objectid < bg_end) return;
}
// We don't have a current block group or we're out of range
// Find the chunk that this bytenr belongs to
m_current_bg = m_chunk_tree.rlower_bound(key.min_objectid);
// Make sure it's a data block group
while (!!m_current_bg) {
// Data block group, stop here
if (m_current_bg.chunk_type() & BTRFS_BLOCK_GROUP_DATA) break;
// Not a data block group, skip to end
key.min_objectid = m_current_bg.offset() + m_current_bg.chunk_length();
m_current_bg = m_chunk_tree.lower_bound(key.min_objectid);
}
if (!m_current_bg) {
// Ran out of data block groups, stop here
return;
}
// Check to see if bytenr is in the current data block group
const auto bg_begin = m_current_bg.offset();
if (key.min_objectid < bg_begin) {
// Move forward to start of data block group
key.min_objectid = bg_begin;
}
}
}

View File

@@ -44,10 +44,10 @@ namespace crucible {
}
ByteVector::value_type&
ByteVector::operator[](size_t size) const
ByteVector::operator[](size_t index) const
{
unique_lock<mutex> lock(m_mutex);
return m_ptr.get()[size];
return m_ptr.get()[index];
}
ByteVector::ByteVector(const ByteVector &that)
@@ -183,7 +183,6 @@ namespace crucible {
ostream&
operator<<(ostream &os, const ByteVector &bv) {
unique_lock<mutex> lock(bv.m_mutex);
hexdump(os, bv);
return os;
}

View File

@@ -76,7 +76,7 @@ namespace crucible {
DIE_IF_ZERO(strftime(buf, sizeof(buf), "%Y-%m-%d %H:%M:%S", &ltm));
header_stream << buf;
header_stream << " " << getpid() << "." << crucible::gettid();
header_stream << " " << getpid() << "." << gettid();
if (add_prefix_level) {
header_stream << "<" << m_loglevel << ">";
}
@@ -88,7 +88,7 @@ namespace crucible {
header_stream << "<" << m_loglevel << ">";
}
header_stream << (m_name.empty() ? "thread" : m_name);
header_stream << "[" << crucible::gettid() << "]";
header_stream << "[" << gettid() << "]";
}
header_stream << ": ";

View File

@@ -159,12 +159,13 @@ namespace crucible {
{
THROW_CHECK1(invalid_argument, src_length, src_length > 0);
while (src_length > 0) {
off_t length = min(off_t(BTRFS_MAX_DEDUPE_LEN), src_length);
BtrfsExtentSame bes(src_fd, src_offset, length);
BtrfsExtentSame bes(src_fd, src_offset, src_length);
bes.add(dst_fd, dst_offset);
bes.do_ioctl();
auto status = bes.m_info.at(0).status;
const auto status = bes.m_info.at(0).status;
if (status == 0) {
const off_t length = bes.m_info.at(0).bytes_deduped;
THROW_CHECK0(invalid_argument, length > 0);
src_offset += length;
dst_offset += length;
src_length -= length;
@@ -333,7 +334,7 @@ namespace crucible {
btrfs_ioctl_logical_ino_args args = (btrfs_ioctl_logical_ino_args) {
.logical = m_logical,
.size = m_container_size,
.inodes = reinterpret_cast<uint64_t>(m_container.prepare(m_container_size)),
.inodes = reinterpret_cast<uintptr_t>(m_container.prepare(m_container_size)),
};
// We are still supporting building with old headers that don't have .flags yet
*(&args.reserved[0] + 3) = m_flags;
@@ -416,7 +417,7 @@ namespace crucible {
{
btrfs_ioctl_ino_path_args *p = static_cast<btrfs_ioctl_ino_path_args *>(this);
BtrfsDataContainer container(m_container_size);
fspath = reinterpret_cast<uint64_t>(container.prepare(m_container_size));
fspath = reinterpret_cast<uintptr_t>(container.prepare(m_container_size));
size = container.get_size();
m_paths.clear();
@@ -753,6 +754,11 @@ namespace crucible {
return offset + len;
}
thread_local size_t BtrfsIoctlSearchKey::s_calls = 0;
thread_local size_t BtrfsIoctlSearchKey::s_loops = 0;
thread_local size_t BtrfsIoctlSearchKey::s_loops_empty = 0;
thread_local shared_ptr<ostream> BtrfsIoctlSearchKey::s_debug_ostream;
bool
BtrfsIoctlSearchKey::do_ioctl_nothrow(int fd)
{
@@ -771,8 +777,17 @@ namespace crucible {
ioctl_ptr = ioctl_arg.get<btrfs_ioctl_search_args_v2>();
ioctl_ptr->key = static_cast<const btrfs_ioctl_search_key&>(*this);
ioctl_ptr->buf_size = buf_size;
if (s_debug_ostream) {
(*s_debug_ostream) << "bisk " << (ioctl_ptr->key) << "\n";
}
// Don't bother supporting V1. Kernels that old have other problems.
int rv = ioctl(fd, BTRFS_IOC_TREE_SEARCH_V2, ioctl_arg.data());
++s_calls;
if (rv != 0 && errno == ENOENT) {
// If we are searching a tree that is deleted or no longer exists, just return an empty list
ioctl_ptr->key.nr_items = 0;
break;
}
if (rv != 0 && errno != EOVERFLOW) {
return false;
}
@@ -794,6 +809,10 @@ namespace crucible {
buf_size *= 2;
}
// don't automatically raise the buf size higher than 64K, the largest possible btrfs item
++s_loops;
if (ioctl_ptr->key.nr_items == 0) {
++s_loops_empty;
}
} while (buf_size < 65536);
// ioctl changes nr_items, this has to be copied back
@@ -866,6 +885,26 @@ namespace crucible {
}
}
string
btrfs_chunk_type_ntoa(uint64_t type)
{
static const bits_ntoa_table table[] = {
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DATA),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_METADATA),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_SYSTEM),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DUP),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID0),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID10),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C3),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C4),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID5),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID6),
NTOA_TABLE_ENTRY_END()
};
return bits_ntoa(type, table);
}
string
btrfs_search_type_ntoa(unsigned type)
{
@@ -893,15 +932,9 @@ namespace crucible {
NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_BLOCK_REF_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_DATA_REF_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_BLOCK_GROUP_ITEM_KEY),
#ifdef BTRFS_FREE_SPACE_INFO_KEY
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_INFO_KEY),
#endif
#ifdef BTRFS_FREE_SPACE_EXTENT_KEY
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_EXTENT_KEY),
#endif
#ifdef BTRFS_FREE_SPACE_BITMAP_KEY
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_BITMAP_KEY),
#endif
NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_EXTENT_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_ITEM_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_CHUNK_ITEM_KEY),
@@ -933,9 +966,7 @@ namespace crucible {
NTOA_TABLE_ENTRY_ENUM(BTRFS_CSUM_TREE_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_QUOTA_TREE_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_UUID_TREE_OBJECTID),
#ifdef BTRFS_FREE_SPACE_TREE_OBJECTID
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_TREE_OBJECTID),
#endif
NTOA_TABLE_ENTRY_ENUM(BTRFS_BALANCE_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_ORPHAN_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_TREE_LOG_OBJECTID),
@@ -1123,11 +1154,17 @@ namespace crucible {
{
}
void
BtrfsIoctlFsInfoArgs::do_ioctl(int fd)
bool
BtrfsIoctlFsInfoArgs::do_ioctl_nothrow(int const fd)
{
btrfs_ioctl_fs_info_args_v3 *p = static_cast<btrfs_ioctl_fs_info_args_v3 *>(this);
if (ioctl(fd, BTRFS_IOC_FS_INFO, p)) {
return 0 == ioctl(fd, BTRFS_IOC_FS_INFO, p);
}
void
BtrfsIoctlFsInfoArgs::do_ioctl(int const fd)
{
if (!do_ioctl_nothrow(fd)) {
THROW_ERRNO("BTRFS_IOC_FS_INFO: fd " << fd);
}
}
@@ -1144,6 +1181,13 @@ namespace crucible {
return this->btrfs_ioctl_fs_info_args_v3::csum_size;
}
vector<uint8_t>
BtrfsIoctlFsInfoArgs::fsid() const
{
const auto begin = btrfs_ioctl_fs_info_args_v3::fsid;
return vector<uint8_t>(begin, begin + BTRFS_FSID_SIZE);
}
uint64_t
BtrfsIoctlFsInfoArgs::generation() const
{

View File

@@ -62,11 +62,22 @@ namespace crucible {
return rv;
}
static MultiLocker s_process_instance;
shared_ptr<MultiLocker::LockHandle>
MultiLocker::get_lock(const string &type)
{
static MultiLocker s_process_instance;
return s_process_instance.get_lock_private(type);
if (s_process_instance.m_do_locking) {
return s_process_instance.get_lock_private(type);
} else {
return shared_ptr<MultiLocker::LockHandle>();
}
}
void
MultiLocker::enable_locking(const bool enabled)
{
s_process_instance.m_do_locking = enabled;
}
}

40
lib/openat2.cc Normal file
View File

@@ -0,0 +1,40 @@
#include "crucible/openat2.h"
#include <sys/syscall.h>
// Compatibility for building on old libc for new kernel
#if LINUX_VERSION_CODE < KERNEL_VERSION(5, 6, 0)
// Every arch that defines this uses 437, except Alpha, where 437 is
// mq_getsetattr.
#ifndef SYS_openat2
#ifdef __alpha__
#define SYS_openat2 547
#else
#define SYS_openat2 437
#endif
#endif
#endif // Linux version >= v5.6
#include <fcntl.h>
#include <unistd.h>
extern "C" {
int
__attribute__((weak))
openat2(int const dirfd, const char *const pathname, struct open_how *const how, size_t const size)
throw()
{
#ifdef SYS_openat2
return syscall(SYS_openat2, dirfd, pathname, how, size);
#else
errno = ENOSYS;
return -1;
#endif
}
};

View File

@@ -7,13 +7,18 @@
#include <cstdlib>
#include <utility>
// for gettid()
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <unistd.h>
#include <sys/syscall.h>
extern "C" {
pid_t
__attribute__((weak))
gettid() throw()
{
return syscall(SYS_gettid);
}
};
namespace crucible {
using namespace std;
@@ -111,12 +116,6 @@ namespace crucible {
}
}
pid_t
gettid()
{
return syscall(SYS_gettid);
}
double
getloadavg1()
{

7
lib/seeker.cc Normal file
View File

@@ -0,0 +1,7 @@
#include "crucible/seeker.h"
namespace crucible {
thread_local shared_ptr<ostream> tl_seeker_debug_str;
};

254
lib/table.cc Normal file
View File

@@ -0,0 +1,254 @@
#include "crucible/table.h"
#include "crucible/string.h"
namespace crucible {
namespace Table {
using namespace std;
Content
Fill(const char c)
{
return [=](size_t width, size_t height) -> string {
string rv;
while (height--) {
rv += string(width, c);
if (height) {
rv += "\n";
}
}
return rv;
};
}
Content
Text(const string &s)
{
return [=](size_t width, size_t height) -> string {
const auto lines = split("\n", s);
string rv;
size_t line_count = 0;
for (const auto &i : lines) {
if (line_count++) {
rv += "\n";
}
if (i.length() < width) {
rv += string(width - i.length(), ' ');
}
rv += i;
}
while (line_count < height) {
if (line_count++) {
rv += "\n";
}
rv += string(width, ' ');
}
return rv;
};
}
Content
Number(const string &s)
{
return [=](size_t width, size_t height) -> string {
const auto lines = split("\n", s);
string rv;
size_t line_count = 0;
for (const auto &i : lines) {
if (line_count++) {
rv += "\n";
}
if (i.length() < width) {
rv += string(width - i.length(), ' ');
}
rv += i;
}
while (line_count < height) {
if (line_count++) {
rv += "\n";
}
rv += string(width, ' ');
}
return rv;
};
}
Cell::Cell(const Content &fn) :
m_content(fn)
{
}
Cell&
Cell::operator=(const Content &fn)
{
m_content = fn;
return *this;
}
string
Cell::text(size_t width, size_t height) const
{
return m_content(width, height);
}
size_t
Dimension::size() const
{
return m_elements.size();
}
size_t
Dimension::insert(size_t pos)
{
++m_next_pos;
const auto insert_pos = min(m_elements.size(), pos);
const auto it = m_elements.begin() + insert_pos;
m_elements.insert(it, m_next_pos);
return insert_pos;
}
void
Dimension::erase(size_t pos)
{
const auto it = m_elements.begin() + min(m_elements.size(), pos);
m_elements.erase(it);
}
size_t
Dimension::at(size_t pos) const
{
return m_elements.at(pos);
}
Dimension&
Table::rows()
{
return m_rows;
};
const Dimension&
Table::rows() const
{
return m_rows;
};
Dimension&
Table::cols()
{
return m_cols;
};
const Dimension&
Table::cols() const
{
return m_cols;
};
const Cell&
Table::at(size_t row, size_t col) const
{
const auto row_idx = m_rows.at(row);
const auto col_idx = m_cols.at(col);
const auto found = m_cells.find(make_pair(row_idx, col_idx));
if (found == m_cells.end()) {
static const Cell s_empty(Fill('.'));
return s_empty;
}
return found->second;
};
Cell&
Table::at(size_t row, size_t col)
{
const auto row_idx = m_rows.at(row);
const auto col_idx = m_cols.at(col);
return m_cells[make_pair(row_idx, col_idx)];
};
static
pair<size_t, size_t>
text_size(const string &s)
{
const auto s_split = split("\n", s);
size_t width = 0;
for (const auto &i : s_split) {
width = max(width, i.length());
}
return make_pair(width, s_split.size());
}
ostream& operator<<(ostream &os, const Table &table)
{
const auto rows = table.rows().size();
const auto cols = table.cols().size();
vector<size_t> row_heights(rows, 1);
vector<size_t> col_widths(cols, 1);
// Get the size of all fixed- and minimum-sized content cells
for (size_t row = 0; row < table.rows().size(); ++row) {
vector<string> col_text;
for (size_t col = 0; col < table.cols().size(); ++col) {
col_text.push_back(table.at(row, col).text(0, 0));
const auto tsize = text_size(*col_text.rbegin());
row_heights[row] = max(row_heights[row], tsize.second);
col_widths[col] = max(col_widths[col], tsize.first);
}
}
// Render the table
for (size_t row = 0; row < table.rows().size(); ++row) {
vector<string> lines(row_heights[row], "");
for (size_t col = 0; col < table.cols().size(); ++col) {
const auto& table_cell = table.at(row, col);
const auto table_text = table_cell.text(col_widths[col], row_heights[row]);
auto col_lines = split("\n", table_text);
col_lines.resize(row_heights[row], "");
for (size_t line = 0; line < row_heights[row]; ++line) {
if (col > 0) {
lines[line] += table.mid();
}
lines[line] += col_lines[line];
}
}
for (const auto &line : lines) {
os << table.left() << line << table.right() << "\n";
}
}
return os;
}
void
Table::left(const string &s)
{
m_left = s;
}
void
Table::mid(const string &s)
{
m_mid = s;
}
void
Table::right(const string &s)
{
m_right = s;
}
const string&
Table::left() const
{
return m_left;
}
const string&
Table::mid() const
{
return m_mid;
}
const string&
Table::right() const
{
return m_right;
}
}
}

View File

@@ -76,13 +76,24 @@ namespace crucible {
/// Tasks to be executed after the current task is executed
list<TaskStatePtr> m_post_exec_queue;
/// Set by run() and append(). Cleared by exec().
/// Set by run(), append(), and insert(). Cleared by exec().
bool m_run_now = false;
/// Set by insert(). Cleared by exec() and destructor.
bool m_sort_queue = false;
/// Set when task starts execution by exec().
/// Cleared when exec() ends.
bool m_is_running = false;
/// Set when task is queued while already running.
/// Cleared when task is requeued.
bool m_run_again = false;
/// Set when task is queued as idle task while already running.
/// Cleared when task is queued as non-idle task.
bool m_idle = false;
/// Sequential identifier for next task
static atomic<TaskId> s_next_id;
@@ -107,7 +118,7 @@ namespace crucible {
static void clear_queue(TaskQueue &tq);
/// Rescue any TaskQueue, not just this one.
static void rescue_queue(TaskQueue &tq);
static void rescue_queue(TaskQueue &tq, const bool sort_queue);
TaskState &operator=(const TaskState &) = delete;
TaskState(const TaskState &) = delete;
@@ -124,6 +135,9 @@ namespace crucible {
/// instance at the end of TaskMaster's global queue.
void run();
/// Run the task when there are no more Tasks on the main queue.
void idle();
/// Execute task immediately in current thread if it is not already
/// executing in another thread; otherwise, append the current task
/// to itself to be executed immediately in the other thread.
@@ -139,6 +153,10 @@ namespace crucible {
/// or is destroyed.
void append(const TaskStatePtr &task);
/// Queue task to execute after current task finishes executing
/// or is destroyed, in task ID order.
void insert(const TaskStatePtr &task);
/// How masy Tasks are there? Good for catching leaks
static size_t instance_count();
};
@@ -150,6 +168,7 @@ namespace crucible {
mutex m_mutex;
condition_variable m_condvar;
TaskQueue m_queue;
TaskQueue m_idle_queue;
size_t m_thread_max;
size_t m_thread_min = 0;
set<TaskConsumerPtr> m_threads;
@@ -184,6 +203,7 @@ namespace crucible {
TaskMasterState(size_t thread_max = thread::hardware_concurrency());
static void push_back(const TaskStatePtr &task);
static void push_back_idle(const TaskStatePtr &task);
static void push_front(TaskQueue &queue);
size_t get_queue_count();
size_t get_thread_count();
@@ -214,16 +234,21 @@ namespace crucible {
static auto s_tms = make_shared<TaskMasterState>();
void
TaskState::rescue_queue(TaskQueue &queue)
TaskState::rescue_queue(TaskQueue &queue, const bool sort_queue)
{
if (queue.empty()) {
return;
}
const auto tlcc = tl_current_consumer;
const auto &tlcc = tl_current_consumer;
if (tlcc) {
// We are executing under a TaskConsumer, splice our post-exec queue at front.
// No locks needed because we are using only thread-local objects.
tlcc->m_local_queue.splice(tlcc->m_local_queue.begin(), queue);
if (sort_queue) {
tlcc->m_local_queue.sort([&](const TaskStatePtr &a, const TaskStatePtr &b) {
return a->m_id < b->m_id;
});
}
} else {
// We are not executing under a TaskConsumer.
// If there is only one task, then just insert it at the front of the queue.
@@ -234,6 +259,8 @@ namespace crucible {
// then push it to the front of the global queue using normal locking methods.
TaskStatePtr rescue_task(make_shared<TaskState>("rescue_task", [](){}));
swap(rescue_task->m_post_exec_queue, queue);
// Do the sort--once--when a new Consumer has picked up the Task
rescue_task->m_sort_queue = sort_queue;
TaskQueue tq_one { rescue_task };
TaskMasterState::push_front(tq_one);
}
@@ -246,7 +273,8 @@ namespace crucible {
--s_instance_count;
unique_lock<mutex> lock(m_mutex);
// If any dependent Tasks were appended since the last exec, run them now
TaskState::rescue_queue(m_post_exec_queue);
TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
// No need to clear m_sort_queue here, it won't exist soon
}
TaskState::TaskState(string title, function<void()> exec_fn) :
@@ -305,6 +333,24 @@ namespace crucible {
task->m_run_now = true;
append_nolock(task);
}
task->m_idle = false;
}
void
TaskState::insert(const TaskStatePtr &task)
{
THROW_CHECK0(invalid_argument, task);
THROW_CHECK2(invalid_argument, m_id, task->m_id, m_id != task->m_id);
PairLock lock(m_mutex, task->m_mutex);
if (!task->m_run_now) {
task->m_run_now = true;
// Move the task and its post-exec queue to follow this task,
// and request a sort of the flattened list.
m_sort_queue = true;
m_post_exec_queue.push_back(task);
m_post_exec_queue.splice(m_post_exec_queue.end(), task->m_post_exec_queue);
}
task->m_idle = false;
}
void
@@ -315,7 +361,7 @@ namespace crucible {
unique_lock<mutex> lock(m_mutex);
if (m_is_running) {
append_nolock(shared_from_this());
m_run_again = true;
return;
} else {
m_run_now = false;
@@ -339,8 +385,20 @@ namespace crucible {
swap(this_task, tl_current_task);
m_is_running = false;
if (m_run_again) {
m_run_again = false;
if (m_idle) {
// All the way back to the end of the line
TaskMasterState::push_back_idle(shared_from_this());
} else {
// Insert after any dependents waiting for this Task
m_post_exec_queue.push_back(shared_from_this());
}
}
// Splice task post_exec queue at front of local queue
TaskState::rescue_queue(m_post_exec_queue);
TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
m_sort_queue = false;
}
string
@@ -360,11 +418,32 @@ namespace crucible {
TaskState::run()
{
unique_lock<mutex> lock(m_mutex);
m_idle = false;
if (m_run_now) {
return;
}
m_run_now = true;
TaskMasterState::push_back(shared_from_this());
if (m_is_running) {
m_run_again = true;
} else {
TaskMasterState::push_back(shared_from_this());
}
}
void
TaskState::idle()
{
unique_lock<mutex> lock(m_mutex);
m_idle = true;
if (m_run_now) {
return;
}
m_run_now = true;
if (m_is_running) {
m_run_again = true;
} else {
TaskMasterState::push_back_idle(shared_from_this());
}
}
TaskMasterState::TaskMasterState(size_t thread_max) :
@@ -410,6 +489,20 @@ namespace crucible {
s_tms->start_threads_nolock();
}
void
TaskMasterState::push_back_idle(const TaskStatePtr &task)
{
THROW_CHECK0(runtime_error, task);
unique_lock<mutex> lock(s_tms->m_mutex);
if (s_tms->m_cancelled) {
task->clear();
return;
}
s_tms->m_idle_queue.push_back(task);
s_tms->m_condvar.notify_all();
s_tms->start_threads_nolock();
}
void
TaskMasterState::push_front(TaskQueue &queue)
{
@@ -456,12 +549,26 @@ namespace crucible {
TaskMaster::print_queue(ostream &os)
{
unique_lock<mutex> lock(s_tms->m_mutex);
os << "Queue (size " << s_tms->m_queue.size() << "):" << endl;
auto queue_copy = s_tms->m_queue;
lock.unlock();
os << "Queue (size " << queue_copy.size() << "):" << endl;
size_t counter = 0;
for (auto i : s_tms->m_queue) {
for (auto i : queue_copy) {
os << "Queue #" << ++counter << " Task ID " << i->id() << " " << i->title() << endl;
}
return os << "Queue End" << endl;
os << "Queue End" << endl;
lock.lock();
queue_copy = s_tms->m_idle_queue;
lock.unlock();
os << "Idle (size " << queue_copy.size() << "):" << endl;
counter = 0;
for (const auto &i : queue_copy) {
os << "Idle #" << ++counter << " Task ID " << i->id() << " " << i->title() << endl;
}
os << "Idle End" << endl;
return os;
}
ostream &
@@ -486,11 +593,6 @@ namespace crucible {
size_t
TaskMasterState::calculate_thread_count_nolock()
{
if (m_paused) {
// No threads running while paused or cancelled
return 0;
}
if (m_load_target == 0) {
// No limits, no stats, use configured thread count
return m_configured_thread_max;
@@ -583,6 +685,7 @@ namespace crucible {
m_cancelled = true;
decltype(m_queue) empty_queue;
m_queue.swap(empty_queue);
empty_queue.splice(empty_queue.end(), m_idle_queue);
m_condvar.notify_all();
lock.unlock();
TaskState::clear_queue(empty_queue);
@@ -600,6 +703,9 @@ namespace crucible {
unique_lock<mutex> lock(m_mutex);
m_paused = paused;
m_condvar.notify_all();
if (!m_paused) {
start_threads_nolock();
}
lock.unlock();
}
@@ -682,6 +788,13 @@ namespace crucible {
m_task_state->run();
}
void
Task::idle() const
{
THROW_CHECK0(runtime_error, m_task_state);
m_task_state->idle();
}
void
Task::append(const Task &that) const
{
@@ -690,6 +803,14 @@ namespace crucible {
m_task_state->append(that.m_task_state);
}
void
Task::insert(const Task &that) const
{
THROW_CHECK0(runtime_error, m_task_state);
THROW_CHECK0(runtime_error, that);
m_task_state->insert(that.m_task_state);
}
Task
Task::current_task()
{
@@ -772,6 +893,9 @@ namespace crucible {
} else if (!master_copy->m_queue.empty()) {
m_current_task = *master_copy->m_queue.begin();
master_copy->m_queue.pop_front();
} else if (!master_copy->m_idle_queue.empty()) {
m_current_task = *master_copy->m_idle_queue.begin();
master_copy->m_idle_queue.pop_front();
} else {
master_copy->m_condvar.wait(lock);
continue;
@@ -801,11 +925,13 @@ namespace crucible {
swap(this_consumer, tl_current_consumer);
assert(!tl_current_consumer);
// Release lock to rescue queue (may attempt to queue a new task at TaskMaster).
// rescue_queue normally sends tasks to the local queue of the current TaskConsumer thread,
// but we just disconnected ourselves from that.
// Release lock to rescue queue (may attempt to queue a
// new task at TaskMaster). rescue_queue normally sends
// tasks to the local queue of the current TaskConsumer
// thread, but we just disconnected ourselves from that.
// No sorting here because this is not a TaskState.
lock.unlock();
TaskState::rescue_queue(m_local_queue);
TaskState::rescue_queue(m_local_queue, false);
// Hold lock so we can erase ourselves
lock.lock();
@@ -883,21 +1009,6 @@ namespace crucible {
m_owner.reset();
}
void
Exclusion::insert_task(const Task &task)
{
unique_lock<mutex> lock(m_mutex);
const auto sp = m_owner.lock();
lock.unlock();
if (sp) {
// If Exclusion is locked then queue task for release;
sp->append(task);
} else {
// otherwise, run the inserted task immediately
task.run();
}
}
ExclusionLock
Exclusion::try_lock(const Task &task)
{
@@ -905,7 +1016,7 @@ namespace crucible {
const auto sp = m_owner.lock();
if (sp) {
if (task) {
sp->append(task);
sp->insert(task);
}
return ExclusionLock();
} else {

View File

@@ -98,12 +98,16 @@ namespace crucible {
m_rate(rate),
m_burst(burst)
{
THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
THROW_CHECK1(invalid_argument, m_burst, m_burst >= 0);
}
RateLimiter::RateLimiter(double rate) :
m_rate(rate),
m_burst(rate)
{
THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
THROW_CHECK1(invalid_argument, m_burst, m_burst >= 0);
}
void
@@ -119,6 +123,7 @@ namespace crucible {
double
RateLimiter::sleep_time(double cost)
{
THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
borrow(cost);
unique_lock<mutex> lock(m_mutex);
update_tokens();
@@ -154,6 +159,21 @@ namespace crucible {
m_tokens -= cost;
}
void
RateLimiter::rate(double const new_rate)
{
THROW_CHECK1(invalid_argument, new_rate, new_rate > 0);
unique_lock<mutex> lock(m_mutex);
m_rate = new_rate;
}
double
RateLimiter::rate() const
{
unique_lock<mutex> lock(m_mutex);
return m_rate;
}
RateEstimator::RateEstimator(double min_delay, double max_delay) :
m_min_delay(min_delay),
m_max_delay(max_delay)
@@ -202,6 +222,13 @@ namespace crucible {
}
}
void
RateEstimator::increment(const uint64_t more)
{
unique_lock<mutex> lock(m_mutex);
return update_unlocked(m_last_count + more);
}
uint64_t
RateEstimator::count() const
{

View File

@@ -1,5 +1,13 @@
#!/bin/bash
# if not called from systemd try to replicate mount unsharing on ctrl+c
# see: https://github.com/Zygo/bees/issues/281
if [ -z "${SYSTEMD_EXEC_PID}" -a -z "${UNSHARE_DONE}" ]; then
UNSHARE_DONE=true
export UNSHARE_DONE
exec unshare -m --propagation private -- "$0" "$@"
fi
## Helpful functions
INFO(){ echo "INFO:" "$@"; }
ERRO(){ echo "ERROR:" "$@"; exit 1; }
@@ -108,13 +116,11 @@ mkdir -p "$WORK_DIR" || exit 1
INFO "MOUNT DIR: $MNT_DIR"
mkdir -p "$MNT_DIR" || exit 1
mount --make-private -osubvolid=5 /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
mount --make-private -osubvolid=5,nodev,noexec /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
if [ ! -d "$BEESHOME" ]; then
INFO "Create subvol $BEESHOME for store bees data"
btrfs sub cre "$BEESHOME"
else
btrfs sub show "$BEESHOME" &> /dev/null || ERRO "$BEESHOME MUST BE A SUBVOL!"
fi
# Check DB size

View File

@@ -17,6 +17,7 @@ KillSignal=SIGTERM
MemoryAccounting=true
Nice=19
Restart=on-abnormal
RuntimeDirectoryMode=0700
RuntimeDirectory=bees
StartupCPUWeight=25
StartupIOWeight=25

View File

@@ -20,7 +20,6 @@
using namespace crucible;
using namespace std;
BeesFdCache::BeesFdCache(shared_ptr<BeesContext> ctx) :
m_ctx(ctx)
{
@@ -98,6 +97,9 @@ BeesContext::dump_status()
TaskMaster::print_queue(ofs);
#endif
ofs << "PROGRESS:\n";
ofs << get_progress();
ofs.close();
BEESNOTE("renaming status file '" << status_file << "'");
@@ -112,6 +114,23 @@ BeesContext::dump_status()
}
}
void
BeesContext::set_progress(const string &str)
{
unique_lock<mutex> lock(m_progress_mtx);
m_progress_str = str;
}
string
BeesContext::get_progress()
{
unique_lock<mutex> lock(m_progress_mtx);
if (m_progress_str.empty()) {
return "[No progress estimate available]\n";
}
return m_progress_str;
}
void
BeesContext::show_progress()
{
@@ -159,6 +178,8 @@ BeesContext::show_progress()
BEESLOGINFO("\ttid " << t.first << ": " << t.second);
}
// No need to log progress here, it is logged when set
lastStats = thisStats;
}
}
@@ -182,7 +203,7 @@ BeesContext::home_fd()
}
bool
BeesContext::is_root_ro(uint64_t root)
BeesContext::is_root_ro(uint64_t const root)
{
return roots()->is_root_ro(root);
}
@@ -192,6 +213,7 @@ BeesContext::dedup(const BeesRangePair &brp_in)
{
// TOOLONG and NOTE can retroactively fill in the filename details, but LOG can't
BEESNOTE("dedup " << brp_in);
BEESTRACE("dedup " << brp_in);
if (is_root_ro(brp_in.second.fid().root())) {
// BEESLOGDEBUG("WORKAROUND: dst root " << (brp_in.second.fid().root()) << " is read-only);
@@ -208,8 +230,10 @@ BeesContext::dedup(const BeesRangePair &brp_in)
BeesAddress first_addr(brp.first.fd(), brp.first.begin());
BeesAddress second_addr(brp.second.fd(), brp.second.begin());
if (first_addr.get_physical_or_zero() == second_addr.get_physical_or_zero()) {
BEESLOGTRACE("equal physical addresses in dedup");
const auto first_gpoz = first_addr.get_physical_or_zero();
const auto second_gpoz = second_addr.get_physical_or_zero();
if (first_gpoz == second_gpoz) {
BEESLOGDEBUG("equal physical addresses " << first_addr << " and " << second_addr << " in dedup");
BEESCOUNT(bug_dedup_same_physical);
}
@@ -219,27 +243,40 @@ BeesContext::dedup(const BeesRangePair &brp_in)
BEESCOUNT(dedup_try);
BEESNOTE("waiting to dedup " << brp);
const auto lock = MultiLocker::get_lock("dedupe");
Timer dedup_timer;
auto lock = MultiLocker::get_lock("dedupe");
BEESLOGINFO("dedup: src " << pretty(brp.first.size()) << " [" << to_hex(brp.first.begin()) << ".." << to_hex(brp.first.end()) << "] {" << first_addr << "} " << name_fd(brp.first.fd()) << "\n"
<< " dst " << pretty(brp.second.size()) << " [" << to_hex(brp.second.begin()) << ".." << to_hex(brp.second.end()) << "] {" << second_addr << "} " << name_fd(brp.second.fd()));
BEESNOTE("dedup: src " << pretty(brp.first.size()) << " [" << to_hex(brp.first.begin()) << ".." << to_hex(brp.first.end()) << "] {" << first_addr << "} " << name_fd(brp.first.fd()) << "\n"
<< " dst " << pretty(brp.second.size()) << " [" << to_hex(brp.second.begin()) << ".." << to_hex(brp.second.end()) << "] {" << second_addr << "} " << name_fd(brp.second.fd()));
const bool rv = btrfs_extent_same(brp.first.fd(), brp.first.begin(), brp.first.size(), brp.second.fd(), brp.second.begin());
BEESCOUNTADD(dedup_ms, dedup_timer.age() * 1000);
while (true) {
try {
Timer dedup_timer;
const bool rv = btrfs_extent_same(brp.first.fd(), brp.first.begin(), brp.first.size(), brp.second.fd(), brp.second.begin());
BEESCOUNTADD(dedup_ms, dedup_timer.age() * 1000);
if (rv) {
BEESCOUNT(dedup_hit);
BEESCOUNTADD(dedup_bytes, brp.first.size());
} else {
BEESCOUNT(dedup_miss);
BEESLOGWARN("NO Dedup! " << brp);
if (rv) {
BEESCOUNT(dedup_hit);
BEESCOUNTADD(dedup_bytes, brp.first.size());
} else {
BEESCOUNT(dedup_miss);
BEESLOGINFO("NO Dedup! " << brp);
}
lock.reset();
bees_throttle(dedup_timer.age(), "dedup");
return rv;
} catch (const std::system_error &e) {
if (e.code().value() == EAGAIN) {
BEESNOTE("dedup waiting for btrfs send on " << brp.second);
BEESLOGDEBUG("dedup waiting for btrfs send on " << brp.second);
roots()->wait_for_transid(1);
} else {
throw;
}
}
}
return rv;
}
BeesRangePair
@@ -264,6 +301,7 @@ BeesContext::rewrite_file_range(const BeesFileRange &bfr)
// BEESLOG("BeesResolver br(..., " << bfr << ")");
BEESTRACE("BeesContext::rewrite_file_range calling BeesResolver " << bfr);
BeesResolver br(m_ctx, BeesAddress(bfr.fd(), bfr.begin()));
BEESTRACE("BeesContext::rewrite_file_range calling replace_src " << dup_bbd);
// BEESLOG("\treplace_src " << dup_bbd);
br.replace_src(dup_bbd);
BEESCOUNT(scan_rewrite);
@@ -291,23 +329,38 @@ BeesContext::rewrite_file_range(const BeesFileRange &bfr)
}
}
BeesFileRange
struct BeesSeenRange {
uint64_t bytenr;
off_t offset;
off_t length;
};
static
bool
operator<(const BeesSeenRange &bsr1, const BeesSeenRange &bsr2)
{
return tie(bsr1.bytenr, bsr1.offset, bsr1.length) < tie(bsr2.bytenr, bsr2.offset, bsr2.length);
}
static
__attribute__((unused))
ostream&
operator<<(ostream &os, const BeesSeenRange &tup)
{
return os << "BeesSeenRange { " << to_hex(tup.bytenr) << ", " << to_hex(tup.offset) << "+" << pretty(tup.length) << " }";
}
void
BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
{
BEESNOTE("Scanning " << pretty(e.size()) << " "
<< to_hex(e.begin()) << ".." << to_hex(e.end())
<< " " << name_fd(bfr.fd()) );
BEESTRACE("scan extent " << e);
BEESTRACE("scan bfr " << bfr);
BEESCOUNT(scan_extent);
// EXPERIMENT: Don't bother with tiny extents unless they are the entire file.
// We'll take a tiny extent at BOF or EOF but not in between.
if (e.begin() && e.size() < 128 * 1024 && e.end() != Stat(bfr.fd()).st_size) {
BEESCOUNT(scan_extent_tiny);
// This doesn't work properly with the current architecture,
// so we don't do an early return here.
// return bfr;
}
Timer one_timer;
// We keep moving this method around
auto m_ctx = shared_from_this();
@@ -322,19 +375,19 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
Extent::OBSCURED | Extent::PREALLOC
)) {
BEESCOUNT(scan_interesting);
BEESLOGWARN("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
BEESLOGINFO("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
}
if (e.flags() & Extent::HOLE) {
// Nothing here, dispose of this early
BEESCOUNT(scan_hole);
return bfr;
return;
}
if (e.flags() & Extent::PREALLOC) {
// Prealloc is all zero and we replace it with a hole.
// No special handling is required here. Nuke it and move on.
BEESLOGINFO("prealloc extent " << e);
BEESLOGINFO("prealloc extent " << e << " in " << bfr);
// Must not extend past EOF
auto extent_size = min(e.end(), bfr.file_size()) - e.begin();
// Must hold tmpfile until dedupe is done
@@ -347,38 +400,57 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
if (m_ctx->dedup(brp)) {
BEESCOUNT(dedup_prealloc_hit);
BEESCOUNTADD(dedup_prealloc_bytes, e.size());
return bfr;
return;
} else {
BEESCOUNT(dedup_prealloc_miss);
}
}
// If we already read this extent and inserted it into the hash table, no need to read it again
static mutex s_seen_mutex;
unique_lock<mutex> lock_seen(s_seen_mutex);
const BeesSeenRange tup = {
.bytenr = e.bytenr(),
.offset = e.offset(),
.length = e.size(),
};
static set<BeesSeenRange> s_seen;
if (s_seen.size() > BEES_MAX_EXTENT_REF_COUNT) {
s_seen.clear();
BEESCOUNT(scan_seen_clear);
}
const auto seen_rv = s_seen.find(tup) != s_seen.end();
if (!seen_rv) {
BEESCOUNT(scan_seen_miss);
} else {
// BEESLOGDEBUG("Skip " << tup << " " << e);
BEESCOUNT(scan_seen_hit);
return;
}
lock_seen.unlock();
// OK we need to read extent now
bees_readahead(bfr.fd(), bfr.begin(), bfr.size());
map<off_t, pair<BeesHash, BeesAddress>> insert_map;
set<off_t> noinsert_set;
// Hole handling
bool extent_compressed = e.flags() & FIEMAP_EXTENT_ENCODED;
bool extent_contains_zero = false;
bool extent_contains_nonzero = false;
// Need to replace extent
bool rewrite_extent = false;
set<off_t> dedupe_set;
set<off_t> zero_set;
// Pretty graphs
off_t block_count = ((e.size() + BLOCK_MASK_SUMS) & ~BLOCK_MASK_SUMS) / BLOCK_SIZE_SUMS;
BEESTRACE(e << " block_count " << block_count);
string bar(block_count, '#');
for (off_t next_p = e.begin(); next_p < e.end(); ) {
// List of dedupes found
list<BeesRangePair> dedupe_list;
list<BeesFileRange> copy_list;
list<pair<BeesHash, BeesAddress>> front_hash_list;
list<uint64_t> invalidate_addr_list;
// Guarantee forward progress
off_t p = next_p;
next_p += BLOCK_SIZE_SUMS;
off_t next_p = e.begin();
for (off_t p = e.begin(); p < e.end(); p += BLOCK_SIZE_SUMS) {
off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
const off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
BeesAddress addr(e, p);
// This extent should consist entirely of non-magic blocks
@@ -393,69 +465,68 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
// Calculate the hash first because it lets us shortcut on is_data_zero
BEESNOTE("scan hash " << bbd);
BeesHash hash = bbd.hash();
const BeesHash hash = bbd.hash();
// Weed out zero blocks
BEESNOTE("is_data_zero " << bbd);
const bool data_is_zero = bbd.is_data_zero();
if (data_is_zero) {
bar.at(bar_p) = '0';
zero_set.insert(p);
BEESCOUNT(scan_zero);
continue;
}
// Schedule this block for insertion if we decide to keep this extent.
BEESCOUNT(scan_hash_preinsert);
BEESTRACE("Pushing hash " << hash << " addr " << addr << " bbd " << bbd);
insert_map.insert(make_pair(p, make_pair(hash, addr)));
bar.at(bar_p) = 'R';
bar.at(bar_p) = 'i';
// Weed out zero blocks
BEESNOTE("is_data_zero " << bbd);
bool extent_is_zero = bbd.is_data_zero();
if (extent_is_zero) {
bar.at(bar_p) = '0';
if (extent_compressed) {
if (!extent_contains_zero) {
// BEESLOG("compressed zero bbd " << bbd << "\n\tin extent " << e);
}
extent_contains_zero = true;
// Do not attempt to lookup hash of zero block
continue;
} else {
BEESLOGINFO("zero bbd " << bbd << "\n\tin extent " << e);
BEESCOUNT(scan_zero_uncompressed);
rewrite_extent = true;
break;
}
} else {
if (extent_contains_zero && !extent_contains_nonzero) {
// BEESLOG("compressed nonzero bbd " << bbd << "\n\tin extent " << e);
}
extent_contains_nonzero = true;
}
// Ensure we fill in the entire insert_map without skipping any non-zero blocks
if (p < next_p) continue;
BEESNOTE("lookup hash " << bbd);
auto found = hash_table->find_cell(hash);
const auto found = hash_table->find_cell(hash);
BEESCOUNT(scan_lookup);
set<BeesResolver> resolved_addrs;
set<BeesAddress> found_addrs;
list<BeesAddress> ordered_addrs;
// We know that there is at least one copy of the data and where it is,
// but we don't want to do expensive LOGICAL_INO operations unless there
// are at least two distinct addresses to look at.
found_addrs.insert(addr);
for (auto i : found) {
for (const auto &i : found) {
BEESTRACE("found (hash, address): " << i);
BEESCOUNT(scan_found);
// Hash has to match
THROW_CHECK2(runtime_error, i.e_hash, hash, i.e_hash == hash);
// We know that there is at least one copy of the data and where it is.
// Filter out anything that can't possibly match before we pull out the
// LOGICAL_INO hammer.
BeesAddress found_addr(i.e_addr);
#if 0
// If address already in hash table, move on to next extent.
// We've already seen this block and may have made additional references to it.
// The current extent is effectively "pinned" and can't be modified any more.
// Only extents that are scanned but not modified are inserted, so if there's
// a matching hash:address pair in the hash table:
// 1. We have already scanned this extent.
// 2. We may have already created references to this extent.
// 3. We won't scan this extent again.
// The current extent is effectively "pinned" and can't be modified
// without rescanning all the existing references.
if (found_addr.get_physical_or_zero() == addr.get_physical_or_zero()) {
// No log message because this happens to many thousands of blocks
// when bees is interrupted.
// BEESLOGDEBUG("Found matching hash " << hash << " at same address " << addr << ", skipping " << bfr);
BEESCOUNT(scan_already);
return bfr;
return;
}
// Address is a duplicate.
// Check this early so we don't have duplicate counts.
if (!found_addrs.insert(found_addr).second) {
BEESCOUNT(scan_twice);
continue;
}
#endif
// Block must have matching EOF alignment
if (found_addr.is_unaligned_eof() != addr.is_unaligned_eof()) {
@@ -463,214 +534,353 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
continue;
}
// Address is a duplicate
if (!found_addrs.insert(found_addr).second) {
BEESCOUNT(scan_twice);
continue;
}
// Hash is toxic
if (found_addr.is_toxic()) {
BEESLOGWARN("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
BEESLOGDEBUG("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
// Don't push these back in because we'll never delete them.
// Extents may become non-toxic so give them a chance to expire.
// hash_table->push_front_hash_addr(hash, found_addr);
BEESCOUNT(scan_toxic_hash);
return bfr;
return;
}
// Distinct address, go resolve it
bool abandon_extent = false;
catch_all([&]() {
BEESNOTE("resolving " << found_addr << " matched " << bbd);
BEESTRACE("resolving " << found_addr << " matched " << bbd);
BEESTRACE("BeesContext::scan_one_extent calling BeesResolver " << found_addr);
BeesResolver resolved(m_ctx, found_addr);
// Toxic extents are really toxic
if (resolved.is_toxic()) {
BEESLOGWARN("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
BEESCOUNT(scan_toxic_match);
// Make sure we never see this hash again.
// It has become toxic since it was inserted into the hash table.
found_addr.set_toxic();
hash_table->push_front_hash_addr(hash, found_addr);
abandon_extent = true;
} else if (!resolved.count()) {
BEESCOUNT(scan_resolve_zero);
// Didn't find anything, address is dead
BEESTRACE("matched hash " << hash << " addr " << addr << " count zero");
hash_table->erase_hash_addr(hash, found_addr);
} else {
resolved_addrs.insert(resolved);
BEESCOUNT(scan_resolve_hit);
}
});
// Put this address in the list without changing hash table order
ordered_addrs.push_back(found_addr);
}
if (abandon_extent) {
return bfr;
// Cheap filtering is now out of the way, now for some heavy lifting
for (auto found_addr : ordered_addrs) {
// Hash table says there's a matching block on the filesystem.
// Go find refs to it.
BEESNOTE("resolving " << found_addr << " matched " << bbd);
BEESTRACE("resolving " << found_addr << " matched " << bbd);
BEESTRACE("BeesContext::scan_one_extent calling BeesResolver " << found_addr);
BeesResolver resolved(m_ctx, found_addr);
// Toxic extents are really toxic
if (resolved.is_toxic()) {
BEESLOGDEBUG("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
BEESCOUNT(scan_toxic_match);
// Make sure we never see this hash again.
// It has become toxic since it was inserted into the hash table.
found_addr.set_toxic();
hash_table->push_front_hash_addr(hash, found_addr);
return;
} else if (!resolved.count()) {
BEESCOUNT(scan_resolve_zero);
// Didn't find a block at the table address, address is dead
BEESLOGDEBUG("Erasing stale addr " << addr << " hash " << hash);
hash_table->erase_hash_addr(hash, found_addr);
continue;
} else {
BEESCOUNT(scan_resolve_hit);
}
}
// This shouldn't happen (often), so let's count it separately
if (resolved_addrs.size() > 2) {
BEESCOUNT(matched_3_or_more);
}
if (resolved_addrs.size() > 1) {
BEESCOUNT(matched_2_or_more);
}
// No need to do all this unless there are two or more distinct matches
if (!resolved_addrs.empty()) {
// `resolved` contains references to a block on the filesystem that still exists.
bar.at(bar_p) = 'M';
BEESCOUNT(matched_1_or_more);
BEESTRACE("resolved_addrs.size() = " << resolved_addrs.size());
BEESNOTE("resolving " << resolved_addrs.size() << " matches for hash " << hash);
BeesFileRange replaced_bfr;
BEESNOTE("finding one match (out of " << resolved.count() << ") at " << resolved.addr() << " for " << bbd);
BEESTRACE("finding one match (out of " << resolved.count() << ") at " << resolved.addr() << " for " << bbd);
auto replaced_brp = resolved.replace_dst(bbd);
BeesFileRange &replaced_bfr = replaced_brp.second;
BEESTRACE("next_p " << to_hex(next_p) << " -> replaced_bfr " << replaced_bfr);
BeesAddress last_replaced_addr;
for (auto it = resolved_addrs.begin(); it != resolved_addrs.end(); ++it) {
// FIXME: Need to terminate this loop on replace_dst exception condition
// catch_all([&]() {
auto it_copy = *it;
BEESNOTE("finding one match (out of " << it_copy.count() << ") at " << it_copy.addr() << " for " << bbd);
BEESTRACE("finding one match (out of " << it_copy.count() << ") at " << it_copy.addr() << " for " << bbd);
replaced_bfr = it_copy.replace_dst(bbd);
BEESTRACE("next_p " << to_hex(next_p) << " -> replaced_bfr " << replaced_bfr);
// If we didn't find this hash where the hash table said it would be,
// correct the hash table.
if (it_copy.found_hash()) {
BEESCOUNT(scan_hash_hit);
} else {
// BEESLOGDEBUG("erase src hash " << hash << " addr " << it_copy.addr());
BEESCOUNT(scan_hash_miss);
hash_table->erase_hash_addr(hash, it_copy.addr());
}
if (it_copy.found_dup()) {
BEESCOUNT(scan_dup_hit);
// FIXME: we will thrash if we let multiple references to identical blocks
// exist in the hash table. Erase all but the last one.
if (last_replaced_addr) {
BEESLOGINFO("Erasing redundant hash " << hash << " addr " << last_replaced_addr);
hash_table->erase_hash_addr(hash, last_replaced_addr);
BEESCOUNT(scan_erase_redundant);
}
last_replaced_addr = it_copy.addr();
// Invalidate resolve cache so we can count refs correctly
m_ctx->invalidate_addr(it_copy.addr());
m_ctx->invalidate_addr(bbd.addr());
// Remove deduped blocks from insert map
THROW_CHECK0(runtime_error, replaced_bfr);
for (off_t ip = replaced_bfr.begin(); ip < replaced_bfr.end(); ip += BLOCK_SIZE_SUMS) {
BEESCOUNT(scan_dup_block);
noinsert_set.insert(ip);
if (ip >= e.begin() && ip < e.end()) {
off_t bar_p = (ip - e.begin()) / BLOCK_SIZE_SUMS;
bar.at(bar_p) = 'd';
}
}
// next_p may be past EOF so check p only
THROW_CHECK2(runtime_error, p, replaced_bfr, p < replaced_bfr.end());
BEESCOUNT(scan_bump);
next_p = replaced_bfr.end();
} else {
BEESCOUNT(scan_dup_miss);
}
// });
// If we did find a block, but not this hash, correct the hash table and move on
if (resolved.found_hash()) {
BEESCOUNT(scan_hash_hit);
} else {
BEESLOGDEBUG("Erasing stale hash " << hash << " addr " << resolved.addr());
hash_table->erase_hash_addr(hash, resolved.addr());
BEESCOUNT(scan_hash_miss);
continue;
}
if (last_replaced_addr) {
// If we replaced extents containing the incoming addr,
// push the addr we kept to the front of the hash LRU.
hash_table->push_front_hash_addr(hash, last_replaced_addr);
BEESCOUNT(scan_push_front);
// We found a block and it was a duplicate
if (resolved.found_dup()) {
THROW_CHECK0(runtime_error, replaced_bfr);
BEESCOUNT(scan_dup_hit);
// Save this match. If a better match is found later,
// it will be replaced.
dedupe_list.push_back(replaced_brp);
// Push matching block to front of LRU
front_hash_list.push_back(make_pair(hash, resolved.addr()));
// This is the block that matched in the replaced bfr
bar.at(bar_p) = '=';
// Invalidate resolve cache so we can count refs correctly
invalidate_addr_list.push_back(resolved.addr());
invalidate_addr_list.push_back(bbd.addr());
// next_p may be past EOF so check p only
THROW_CHECK2(runtime_error, p, replaced_bfr, p < replaced_bfr.end());
// We may find duplicate ranges of various lengths, so make sure
// we don't pick a smaller one
next_p = max(next_p, replaced_bfr.end());
// Stop after one dedupe is found. If there's a longer matching range
// out there, we'll find a matching block after the end of this range,
// since the longer range is longer than this one.
break;
} else {
BEESCOUNT(scan_dup_miss);
}
} else {
BEESCOUNT(matched_0);
}
}
// If the extent was compressed and all zeros, nuke entire thing
if (!rewrite_extent && (extent_contains_zero && !extent_contains_nonzero)) {
rewrite_extent = true;
BEESCOUNT(scan_zero_compressed);
bool force_insert = false;
// We don't want to punch holes into compressed extents, unless:
// 1. There was dedupe of non-zero blocks, so we always have to copy the rest of the extent
// 2. The entire extent is zero and the whole thing can be replaced with a single hole
const bool extent_compressed = e.flags() & FIEMAP_EXTENT_ENCODED;
if (extent_compressed && dedupe_list.empty() && !insert_map.empty()) {
// BEESLOGDEBUG("Compressed extent with non-zero data and no dedupe, skipping");
BEESCOUNT(scan_compressed_no_dedup);
force_insert = true;
}
// If we deduped any blocks then we must rewrite the remainder of the extent
if (!noinsert_set.empty()) {
rewrite_extent = true;
// FIXME: dedupe_list contains a lot of overlapping matches. Get rid of all but one.
list<BeesRangePair> dedupe_list_out;
dedupe_list.sort([](const BeesRangePair &a, const BeesRangePair &b) {
return b.second.size() < a.second.size();
});
// Shorten each dedupe brp by removing any overlap with earlier (longer) extents in list
for (auto i : dedupe_list) {
bool insert_i = true;
BEESTRACE("i = " << i << " insert_i " << insert_i);
for (const auto &j : dedupe_list_out) {
BEESTRACE("j = " << j);
// No overlap, try next one
if (j.second.end() <= i.second.begin() || j.second.begin() >= i.second.end()) {
continue;
}
// j fully overlaps or is the same as i, drop i
if (j.second.begin() <= i.second.begin() && j.second.end() >= i.second.end()) {
insert_i = false;
break;
}
// i begins outside j, i ends inside j, remove the end of i
if (i.second.end() > j.second.begin() && i.second.begin() <= j.second.begin()) {
const auto delta = i.second.end() - j.second.begin();
if (delta == i.second.size()) {
insert_i = false;
break;
}
i.shrink_end(delta);
continue;
}
// i begins inside j, ends outside j, remove the begin of i
if (i.second.begin() < j.second.end() && i.second.end() >= j.second.end()) {
const auto delta = j.second.end() - i.second.begin();
if (delta == i.second.size()) {
insert_i = false;
break;
}
i.shrink_begin(delta);
continue;
}
// i fully overlaps j, split i into two parts, push the other part onto dedupe_list
if (j.second.begin() > i.second.begin() && j.second.end() < i.second.end()) {
auto other_i = i;
const auto end_left_delta = i.second.end() - j.second.begin();
const auto begin_right_delta = i.second.begin() - j.second.end();
i.shrink_end(end_left_delta);
other_i.shrink_begin(begin_right_delta);
dedupe_list.push_back(other_i);
continue;
}
// None of the sbove. Oops!
THROW_CHECK0(runtime_error, false);
}
if (insert_i) {
dedupe_list_out.push_back(i);
}
}
dedupe_list = dedupe_list_out;
dedupe_list_out.clear();
// Count total dedupes
uint64_t bytes_deduped = 0;
for (const auto &i : dedupe_list) {
// Remove deduped blocks from insert map and zero map
for (off_t ip = i.second.begin(); ip < i.second.end(); ip += BLOCK_SIZE_SUMS) {
BEESCOUNT(scan_dup_block);
dedupe_set.insert(ip);
zero_set.erase(ip);
}
bytes_deduped += i.second.size();
}
// If we need to replace part of the extent, rewrite all instances of it
if (rewrite_extent) {
bool blocks_rewritten = false;
// Copy all blocks of the extent that were not deduped or zero, but don't copy an entire extent
uint64_t bytes_zeroed = 0;
if (!force_insert) {
BEESTRACE("Rewriting extent " << e);
off_t last_p = e.begin();
off_t p = last_p;
off_t next_p;
off_t next_p = last_p;
BEESTRACE("next_p " << to_hex(next_p) << " p " << to_hex(p) << " last_p " << to_hex(last_p));
for (next_p = e.begin(); next_p < e.end(); ) {
p = next_p;
next_p += BLOCK_SIZE_SUMS;
next_p = min(next_p + BLOCK_SIZE_SUMS, e.end());
// BEESLOG("noinsert_set.count(" << to_hex(p) << ") " << noinsert_set.count(p));
if (noinsert_set.count(p)) {
// Can't be both dedupe and zero
THROW_CHECK2(runtime_error, zero_set.count(p), dedupe_set.count(p), zero_set.count(p) + dedupe_set.count(p) < 2);
if (zero_set.count(p)) {
bytes_zeroed += next_p - p;
}
// BEESLOG("dedupe_set.count(" << to_hex(p) << ") " << dedupe_set.count(p));
if (dedupe_set.count(p)) {
if (p - last_p > 0) {
rewrite_file_range(BeesFileRange(bfr.fd(), last_p, p));
blocks_rewritten = true;
THROW_CHECK2(runtime_error, p, e.end(), p <= e.end());
copy_list.push_back(BeesFileRange(bfr.fd(), last_p, p));
}
last_p = next_p;
} else {
off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
bar.at(bar_p) = '+';
}
}
BEESTRACE("last");
if (next_p - last_p > 0) {
rewrite_file_range(BeesFileRange(bfr.fd(), last_p, next_p));
blocks_rewritten = true;
}
if (blocks_rewritten) {
// Nothing left to insert, all blocks clobbered
insert_map.clear();
} else {
// BEESLOG("No blocks rewritten");
BEESCOUNT(scan_no_rewrite);
if (next_p > last_p) {
THROW_CHECK2(runtime_error, next_p, e.end(), next_p <= e.end());
copy_list.push_back(BeesFileRange(bfr.fd(), last_p, next_p));
}
}
// We did not rewrite the extent and it contained data, so insert it.
for (auto i : insert_map) {
off_t bar_p = (i.first - e.begin()) / BLOCK_SIZE_SUMS;
BEESTRACE("e " << e << "bar_p = " << bar_p << " i.first-e.begin() " << i.first - e.begin() << " i.second " << i.second.first << ", " << i.second.second);
if (noinsert_set.count(i.first)) {
// FIXME: we removed one reference to this copy. Avoid thrashing?
hash_table->erase_hash_addr(i.second.first, i.second.second);
// Block was clobbered, do not insert
// Will look like 'Ddddd' because we skip deduped blocks
bar.at(bar_p) = 'D';
BEESCOUNT(inserted_clobbered);
// Don't copy an entire extent
if (!bytes_zeroed && copy_list.size() == 1 && copy_list.begin()->size() == e.size()) {
copy_list.clear();
}
// Count total copies
uint64_t bytes_copied = 0;
for (const auto &i : copy_list) {
bytes_copied += i.size();
}
BEESTRACE("bar: " << bar);
// Don't do nuisance dedupes part 1: free more blocks than we create
THROW_CHECK3(runtime_error, bytes_copied, bytes_zeroed, bytes_deduped, bytes_copied >= bytes_zeroed);
const auto cost_copy = bytes_copied - bytes_zeroed;
const auto gain_dedupe = bytes_deduped + bytes_zeroed;
if (cost_copy > gain_dedupe) {
BEESLOGDEBUG("Too many bytes copied (" << pretty(bytes_copied) << ") for bytes deduped (" << pretty(bytes_deduped) << ") and holes punched (" << pretty(bytes_zeroed) << "), skipping extent");
BEESCOUNT(scan_skip_bytes);
force_insert = true;
}
// Don't do nuisance dedupes part 2: nobody needs more than 100 dedupe/copy ops in one extent
if (dedupe_list.size() + copy_list.size() > 100) {
BEESLOGDEBUG("Too many dedupe (" << dedupe_list.size() << ") and copy (" << copy_list.size() << ") operations, skipping extent");
BEESCOUNT(scan_skip_ops);
force_insert = true;
}
// Track whether we rewrote anything
bool extent_modified = false;
// If we didn't delete the dedupe list, do the dedupes now
for (const auto &i : dedupe_list) {
BEESNOTE("dedup " << i);
if (force_insert || m_ctx->dedup(i)) {
BEESCOUNT(replacedst_dedup_hit);
THROW_CHECK0(runtime_error, i.second);
for (off_t ip = i.second.begin(); ip < i.second.end(); ip += BLOCK_SIZE_SUMS) {
if (ip >= e.begin() && ip < e.end()) {
off_t bar_p = (ip - e.begin()) / BLOCK_SIZE_SUMS;
if (bar.at(bar_p) != '=') {
if (ip == i.second.begin()) {
bar.at(bar_p) = '<';
} else if (ip + BLOCK_SIZE_SUMS >= i.second.end()) {
bar.at(bar_p) = '>';
} else {
bar.at(bar_p) = 'd';
}
}
}
}
extent_modified = !force_insert;
} else {
BEESLOGINFO("dedup failed: " << i);
BEESCOUNT(replacedst_dedup_miss);
// User data changed while we were looking up the extent, or we have a bug.
// We can't fix this, but we can immediately stop wasting effort.
return;
}
}
// Then the copy/rewrites
for (const auto &i : copy_list) {
if (!force_insert) {
rewrite_file_range(i);
extent_modified = true;
}
for (auto p = i.begin(); p < i.end(); p += BLOCK_SIZE_SUMS) {
off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
// Leave zeros as-is because they aren't really copies
if (bar.at(bar_p) != '0') {
bar.at(bar_p) = '+';
}
}
}
if (!force_insert) {
// Push matched hashes to front
for (const auto &i : front_hash_list) {
hash_table->push_front_hash_addr(i.first, i.second);
BEESCOUNT(scan_push_front);
}
// Invalidate cached resolves
for (const auto &i : invalidate_addr_list) {
m_ctx->invalidate_addr(i);
}
}
// Don't insert hashes pointing to an extent we just deleted
if (!extent_modified) {
// We did not rewrite the extent and it contained data, so insert it.
// BEESLOGDEBUG("Inserting " << insert_map.size() << " hashes from " << bfr);
for (const auto &i : insert_map) {
hash_table->push_random_hash_addr(i.second.first, i.second.second);
bar.at(bar_p) = '.';
BEESCOUNT(inserted_block);
off_t bar_p = (i.first - e.begin()) / BLOCK_SIZE_SUMS;
if (bar.at(bar_p) == 'i') {
bar.at(bar_p) = '.';
}
BEESCOUNT(scan_hash_insert);
}
}
// Visualize
if (bar != string(block_count, '.')) {
BEESLOGINFO("scan: " << pretty(e.size()) << " " << to_hex(e.begin()) << " [" << bar << "] " << to_hex(e.end()) << ' ' << name_fd(bfr.fd()));
BEESLOGINFO(
(force_insert ? "skip" : "scan") << ": "
<< pretty(e.size()) << " "
<< dedupe_list.size() << "d" << copy_list.size() << "c"
<< ((bytes_zeroed + BLOCK_SIZE_SUMS - 1) / BLOCK_SIZE_SUMS) << "p"
<< (extent_compressed ? "z " : " ")
<< one_timer << "s {"
<< to_hex(e.bytenr()) << "+" << to_hex(e.offset()) << "} "
<< to_hex(e.begin()) << " [" << bar << "] " << to_hex(e.end())
<< ' ' << name_fd(bfr.fd())
);
}
// Costs 10% on benchmarks
// Put this extent into the recently seen list if we didn't rewrite it,
// and remove it if we did.
lock_seen.lock();
if (extent_modified) {
s_seen.erase(tup);
BEESCOUNT(scan_seen_erase);
} else {
// BEESLOGDEBUG("Seen " << tup << " " << e);
s_seen.insert(tup);
BEESCOUNT(scan_seen_insert);
}
lock_seen.unlock();
// Now causes 75% loss of performance in benchmarks
// bees_unreadahead(bfr.fd(), bfr.begin(), bfr.size());
return bfr;
}
shared_ptr<Exclusion>
@@ -703,14 +913,14 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)
// No FD? Well, that was quick.
if (!bfr.fd()) {
// BEESLOGINFO("No FD in " << root_path() << " for " << bfr);
BEESCOUNT(scan_no_fd);
BEESCOUNT(scanf_no_fd);
return false;
}
// Sanity check
if (bfr.begin() >= bfr.file_size()) {
BEESLOGWARN("past EOF: " << bfr);
BEESCOUNT(scan_eof);
BEESLOGDEBUG("past EOF: " << bfr);
BEESCOUNT(scanf_eof);
return false;
}
@@ -730,9 +940,11 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)
// BEESLOGDEBUG("Deferring extent bytenr " << to_hex(extent_bytenr) << " from " << bfr);
BEESCOUNT(scanf_deferred_extent);
start_over = true;
return; // from closure
}
Timer one_extent_timer;
scan_one_extent(bfr, e);
// BEESLOGDEBUG("Scanned " << e << " " << bfr);
BEESCOUNTADD(scanf_extent_ms, one_extent_timer.age() * 1000);
BEESCOUNT(scanf_extent);
});
@@ -784,9 +996,10 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
Timer resolve_timer;
struct rusage usage_before;
struct rusage usage_after;
{
BEESNOTE("waiting to resolve addr " << addr << " with LOGICAL_INO");
const auto lock = MultiLocker::get_lock("logical_ino");
auto lock = MultiLocker::get_lock("logical_ino");
// Get this thread's system CPU usage
DIE_IF_MINUS_ONE(getrusage(RUSAGE_THREAD, &usage_before));
@@ -800,13 +1013,13 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
} else {
BEESCOUNT(resolve_fail);
}
BEESCOUNTADD(resolve_ms, resolve_timer.age() * 1000);
DIE_IF_MINUS_ONE(getrusage(RUSAGE_THREAD, &usage_after));
const auto resolve_timer_age = resolve_timer.age();
BEESCOUNTADD(resolve_ms, resolve_timer_age * 1000);
lock.reset();
bees_throttle(resolve_timer_age, "resolve_addr");
}
// Again!
struct rusage usage_after;
DIE_IF_MINUS_ONE(getrusage(RUSAGE_THREAD, &usage_after));
const double sys_usage_delta =
(usage_after.ru_stime.tv_sec + usage_after.ru_stime.tv_usec / 1000000.0) -
(usage_before.ru_stime.tv_sec + usage_before.ru_stime.tv_usec / 1000000.0);
@@ -925,7 +1138,8 @@ BeesContext::start()
return make_shared<BeesTempFile>(shared_from_this());
});
m_logical_ino_pool.generator([]() {
return make_shared<BtrfsIoctlLogicalInoArgs>(0);
const auto extent_ref_size = sizeof(uint64_t) * 3;
return make_shared<BtrfsIoctlLogicalInoArgs>(0, BEES_MAX_EXTENT_REF_COUNT * extent_ref_size + sizeof(btrfs_data_container));
});
m_tmpfile_pool.checkin([](const shared_ptr<BeesTempFile> &btf) {
catch_all([&](){

View File

@@ -356,6 +356,8 @@ BeesHashTable::prefetch_loop()
auto avg_rates = thisStats / m_ctx->total_timer().age();
graph_blob << "\t" << avg_rates << "\n";
graph_blob << m_ctx->get_progress();
BEESLOGINFO(graph_blob.str());
catch_all([&]() {
m_stats_file.write(graph_blob.str());
@@ -446,10 +448,38 @@ BeesHashTable::fetch_missing_extent_by_index(uint64_t extent_index)
// If we are in prefetch, give the kernel a hint about the next extent
if (m_prefetch_running) {
// XXX: don't call this if bees_readahead is implemented by pread()
bees_readahead(m_fd, dirty_extent_offset + dirty_extent_size, dirty_extent_size);
// Use the kernel readahead here, because it might work for this use case
readahead(m_fd, dirty_extent_offset + dirty_extent_size, dirty_extent_size);
}
});
Cell *cell = m_extent_ptr[extent_index ].p_buckets[0].p_cells;
Cell *cell_end = m_extent_ptr[extent_index + 1].p_buckets[0].p_cells;
size_t toxic_cleared_count = 0;
set<BeesHashTable::Cell> seen_it(cell, cell_end);
while (cell < cell_end) {
if (cell->e_addr & BeesAddress::c_toxic_mask) {
++toxic_cleared_count;
cell->e_addr &= ~BeesAddress::c_toxic_mask;
// Clearing the toxic bit might mean we now have a duplicate.
// This could be due to a race between two
// inserts, one finds the extent toxic while the
// other does not. That's arguably a bug elsewhere,
// but we should rewrite the whole extent lookup/insert
// loop, not spend time fixing code that will be
// thrown out later anyway.
// If there is a cell that is identical to this one
// except for the toxic bit, then we don't need this one.
if (seen_it.count(*cell)) {
cell->e_addr = 0;
cell->e_hash = 0;
}
}
++cell;
}
if (toxic_cleared_count) {
BEESLOGDEBUG("Cleared " << toxic_cleared_count << " hashes while fetching hash table extent " << extent_index);
}
}
void
@@ -767,7 +797,7 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
for (auto fp = madv_flags; fp->value; ++fp) {
BEESTOOLONG("madvise(" << fp->name << ")");
if (madvise(m_byte_ptr, m_size, fp->value)) {
BEESLOGWARN("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
BEESLOGNOTICE("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
}
}
@@ -781,8 +811,19 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
prefetch_loop();
});
// Blacklist might fail if the hash table is not stored on a btrfs
// Blacklist might fail if the hash table is not stored on a btrfs,
// or if it's on a _different_ btrfs
catch_all([&]() {
// Root is definitely a btrfs
BtrfsIoctlFsInfoArgs root_info;
root_info.do_ioctl(m_ctx->root_fd());
// Hash might not be a btrfs
BtrfsIoctlFsInfoArgs hash_info;
// If btrfs fs_info ioctl fails, it must be a different fs
if (!hash_info.do_ioctl_nothrow(m_fd)) return;
// If Hash is a btrfs, Root must be the same one
if (root_info.fsid() != hash_info.fsid()) return;
// Hash is on the same one, blacklist it
m_ctx->blacklist_insert(BeesFileId(m_fd));
});
}

View File

@@ -384,7 +384,7 @@ BeesResolver::for_each_extent_ref(BeesBlockData bbd, function<bool(const BeesFil
return stop_now;
}
BeesFileRange
BeesRangePair
BeesResolver::replace_dst(const BeesFileRange &dst_bfr_in)
{
BEESTRACE("replace_dst dst_bfr " << dst_bfr_in);
@@ -400,6 +400,7 @@ BeesResolver::replace_dst(const BeesFileRange &dst_bfr_in)
BEESTRACE("overlap_bfr " << overlap_bfr);
BeesBlockData bbd(dst_bfr);
BeesRangePair rv = { BeesFileRange(), BeesFileRange() };
for_each_extent_ref(bbd, [&](const BeesFileRange &src_bfr_in) -> bool {
// Open src
@@ -436,21 +437,12 @@ BeesResolver::replace_dst(const BeesFileRange &dst_bfr_in)
BEESCOUNT(replacedst_grown);
}
// Dedup
BEESNOTE("dedup " << brp);
if (m_ctx->dedup(brp)) {
BEESCOUNT(replacedst_dedup_hit);
m_found_dup = true;
overlap_bfr = brp.second;
// FIXME: find best range first, then dedupe that
return true; // i.e. break
} else {
BEESCOUNT(replacedst_dedup_miss);
return false; // i.e. continue
}
rv = brp;
m_found_dup = true;
return true;
});
// BEESLOG("overlap_bfr after " << overlap_bfr);
return overlap_bfr.copy_closed();
return rv;
}
BeesFileRange

File diff suppressed because it is too large Load Diff

View File

@@ -8,38 +8,32 @@ thread_local BeesTracer *BeesTracer::tl_next_tracer = nullptr;
thread_local bool BeesTracer::tl_first = true;
thread_local bool BeesTracer::tl_silent = false;
bool
exception_check()
{
#if __cplusplus >= 201703
static
bool
exception_check()
{
return uncaught_exceptions();
}
#else
static
bool
exception_check()
{
return uncaught_exception();
}
#endif
}
BeesTracer::~BeesTracer()
{
if (!tl_silent && exception_check()) {
if (tl_first) {
BEESLOGNOTICE("--- BEGIN TRACE --- exception ---");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE --- exception ---");
tl_first = false;
}
try {
m_func();
} catch (exception &e) {
BEESLOGNOTICE("Nested exception: " << e.what());
BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception: " << e.what());
} catch (...) {
BEESLOGNOTICE("Nested exception ...");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception ...");
}
if (!m_next_tracer) {
BEESLOGNOTICE("--- END TRACE --- exception ---");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- END TRACE --- exception ---");
}
}
tl_next_tracer = m_next_tracer;
@@ -49,7 +43,7 @@ BeesTracer::~BeesTracer()
}
}
BeesTracer::BeesTracer(function<void()> f, bool silent) :
BeesTracer::BeesTracer(const function<void()> &f, bool silent) :
m_func(f)
{
m_next_tracer = tl_next_tracer;
@@ -61,12 +55,12 @@ void
BeesTracer::trace_now()
{
BeesTracer *tp = tl_next_tracer;
BEESLOGNOTICE("--- BEGIN TRACE ---");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE ---");
while (tp) {
tp->m_func();
tp = tp->m_next_tracer;
}
BEESLOGNOTICE("--- END TRACE ---");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- END TRACE ---");
}
bool
@@ -91,9 +85,9 @@ BeesNote::~BeesNote()
tl_next = m_prev;
unique_lock<mutex> lock(s_mutex);
if (tl_next) {
s_status[crucible::gettid()] = tl_next;
s_status[gettid()] = tl_next;
} else {
s_status.erase(crucible::gettid());
s_status.erase(gettid());
}
}
@@ -104,7 +98,7 @@ BeesNote::BeesNote(function<void(ostream &os)> f) :
m_prev = tl_next;
tl_next = this;
unique_lock<mutex> lock(s_mutex);
s_status[crucible::gettid()] = tl_next;
s_status[gettid()] = tl_next;
}
void

View File

@@ -183,6 +183,24 @@ BeesFileRange::grow_begin(off_t delta)
return m_begin;
}
off_t
BeesFileRange::shrink_begin(off_t delta)
{
THROW_CHECK1(invalid_argument, delta, delta > 0);
THROW_CHECK3(invalid_argument, delta, m_begin, m_end, delta + m_begin < m_end);
m_begin += delta;
return m_begin;
}
off_t
BeesFileRange::shrink_end(off_t delta)
{
THROW_CHECK1(invalid_argument, delta, delta > 0);
THROW_CHECK2(invalid_argument, delta, m_end, m_end >= delta);
m_end -= delta;
return m_end;
}
BeesFileRange::BeesFileRange(const BeesBlockData &bbd) :
m_fd(bbd.fd()),
m_begin(bbd.begin()),
@@ -349,8 +367,8 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
BEESTRACE("e_second " << e_second);
// Preread entire extent
bees_readahead(second.fd(), e_second.begin(), e_second.size());
bees_readahead(first.fd(), e_second.begin() + first.begin() - second.begin(), e_second.size());
bees_readahead_pair(second.fd(), e_second.begin(), e_second.size(),
first.fd(), e_second.begin() + first.begin() - second.begin(), e_second.size());
auto hash_table = ctx->hash_table();
@@ -388,17 +406,6 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
break;
}
// Source extent cannot be toxic
BeesAddress first_addr(first.fd(), new_first.begin());
if (!first_addr.is_magic()) {
auto first_resolved = ctx->resolve_addr(first_addr);
if (first_resolved.is_toxic()) {
BEESLOGWARN("WORKAROUND: not growing matching pair backward because src addr is toxic:\n" << *this);
BEESCOUNT(pairbackward_toxic_addr);
break;
}
}
// Extend second range. If we hit BOF we can go no further.
BeesFileRange new_second = second;
BEESTRACE("new_second = " << new_second);
@@ -434,6 +441,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
// Source block cannot be zero in a non-compressed non-magic extent
BeesAddress first_addr(first.fd(), new_first.begin());
if (first_bbd.is_data_zero() && !first_addr.is_magic() && !first_addr.is_compressed()) {
BEESCOUNT(pairbackward_zero);
break;
@@ -449,7 +457,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
}
if (found_toxic) {
BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
BEESCOUNT(pairbackward_toxic_hash);
break;
}
@@ -491,17 +499,6 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
break;
}
// Source extent cannot be toxic
BeesAddress first_addr(first.fd(), new_first.begin());
if (!first_addr.is_magic()) {
auto first_resolved = ctx->resolve_addr(first_addr);
if (first_resolved.is_toxic()) {
BEESLOGWARN("WORKAROUND: not growing matching pair forward because src is toxic:\n" << *this);
BEESCOUNT(pairforward_toxic);
break;
}
}
// Extend second range. If we hit EOF we can go no further.
BeesFileRange new_second = second;
BEESTRACE("new_second = " << new_second);
@@ -545,6 +542,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
// Source block cannot be zero in a non-compressed non-magic extent
BeesAddress first_addr(first.fd(), new_first.begin());
if (first_bbd.is_data_zero() && !first_addr.is_magic() && !first_addr.is_compressed()) {
BEESCOUNT(pairforward_zero);
break;
@@ -560,7 +558,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
}
if (found_toxic) {
BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
BEESCOUNT(pairforward_toxic_hash);
break;
}
@@ -574,7 +572,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
if (first.overlaps(second)) {
BEESLOGTRACE("after grow, first " << first << "\n\toverlaps " << second);
BEESLOGDEBUG("after grow, first " << first << "\n\toverlaps " << second);
BEESCOUNT(bug_grow_pair_overlaps);
}
@@ -589,6 +587,22 @@ BeesRangePair::copy_closed() const
return BeesRangePair(first.copy_closed(), second.copy_closed());
}
void
BeesRangePair::shrink_begin(off_t const delta)
{
first.shrink_begin(delta);
second.shrink_begin(delta);
THROW_CHECK2(runtime_error, first.size(), second.size(), first.size() == second.size());
}
void
BeesRangePair::shrink_end(off_t const delta)
{
first.shrink_end(delta);
second.shrink_end(delta);
THROW_CHECK2(runtime_error, first.size(), second.size(), first.size() == second.size());
}
ostream &
operator<<(ostream &os, const BeesAddress &ba)
{
@@ -660,7 +674,7 @@ BeesAddress::magic_check(uint64_t flags)
static const unsigned recognized_flags = compressed_flags | delalloc_flags | ignore_flags | unusable_flags;
if (flags & ~recognized_flags) {
BEESLOGTRACE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
BEESLOGNOTICE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
m_addr = UNUSABLE;
// maybe we throw here?
BEESCOUNT(addr_unrecognized);

View File

@@ -12,9 +12,10 @@ Load management options:
-C, --thread-factor Worker thread factor (default 1)
-G, --thread-min Minimum worker thread count (default 0)
-g, --loadavg-target Target load average for worker threads (default none)
--throttle-factor Idle time between operations (default 1.0)
Filesystem tree traversal options:
-m, --scan-mode Scanning mode (0..2, default 0)
-m, --scan-mode Scanning mode (0..4, default 4)
Workarounds:
-a, --workaround-btrfs-send Workaround for btrfs send

View File

@@ -4,6 +4,7 @@
#include "crucible/process.h"
#include "crucible/string.h"
#include "crucible/task.h"
#include "crucible/uname.h"
#include <cctype>
#include <cmath>
@@ -11,17 +12,19 @@
#include <iostream>
#include <memory>
#include <regex>
#include <sstream>
// PRIx64
#include <inttypes.h>
#include <sched.h>
#include <sys/fanotify.h>
#include <linux/fs.h>
#include <sys/ioctl.h>
// statfs
#include <linux/magic.h>
#include <sys/statfs.h>
// setrlimit
#include <sys/time.h>
#include <sys/resource.h>
@@ -198,7 +201,7 @@ BeesTooLong::check() const
if (age() > m_limit) {
ostringstream oss;
m_func(oss);
BEESLOGWARN("PERFORMANCE: " << *this << " sec: " << oss.str());
BEESLOGINFO("PERFORMANCE: " << *this << " sec: " << oss.str());
}
}
@@ -214,21 +217,41 @@ BeesTooLong::operator=(const func_type &f)
return *this;
}
void
bees_readahead(int const fd, const off_t offset, const size_t size)
static
bool
bees_readahead_check(int const fd, off_t const offset, size_t const size)
{
// FIXME: the rest of the code calls this function more often than necessary,
// usually back-to-back calls on the same range in a loop.
// Simply discard requests that are identical to recent requests.
const Stat stat_rv(fd);
auto tup = make_tuple(offset, size, stat_rv.st_dev, stat_rv.st_ino);
static mutex s_recent_mutex;
static set<decltype(tup)> s_recent;
unique_lock<mutex> lock(s_recent_mutex);
if (s_recent.size() > BEES_MAX_EXTENT_REF_COUNT) {
s_recent.clear();
BEESCOUNT(readahead_clear);
}
const auto rv = s_recent.insert(tup);
// If we recently did this readahead, we're done here
if (!rv.second) {
BEESCOUNT(readahead_skip);
}
return rv.second;
}
static
void
bees_readahead_nolock(int const fd, const off_t offset, const size_t size)
{
if (!bees_readahead_check(fd, offset, size)) return;
Timer readahead_timer;
BEESNOTE("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
BEESTOOLONG("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
#if 0
// In the kernel, readahead() is identical to posix_fadvise(..., POSIX_FADV_DONTNEED)
DIE_IF_NON_ZERO(readahead(fd, offset, size));
#else
// Make sure this data is in page cache by brute force
// This isn't necessary and it might even be slower,
// but the btrfs kernel code does readahead with lower ioprio
// and might discard the readahead request entirely,
// so it's maybe, *maybe*, worth doing both.
// The btrfs kernel code does readahead with lower ioprio
// and might discard the readahead request entirely.
BEESNOTE("emulating readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
auto working_size = size;
auto working_offset = offset;
@@ -239,16 +262,41 @@ bees_readahead(int const fd, const off_t offset, const size_t size)
// Ignore errors and short reads. It turns out our size
// parameter isn't all that accurate, so we can't use
// the pread_or_die template.
(void)!pread(fd, dummy, this_read_size, working_offset);
BEESCOUNT(readahead_count);
BEESCOUNTADD(readahead_bytes, this_read_size);
const auto pr_rv = pread(fd, dummy, this_read_size, working_offset);
if (pr_rv >= 0) {
BEESCOUNT(readahead_count);
BEESCOUNTADD(readahead_bytes, pr_rv);
} else {
BEESCOUNT(readahead_fail);
}
working_offset += this_read_size;
working_size -= this_read_size;
}
#endif
BEESCOUNTADD(readahead_ms, readahead_timer.age() * 1000);
}
static mutex s_only_one;
void
bees_readahead_pair(int fd, off_t offset, size_t size, int fd2, off_t offset2, size_t size2)
{
if (!bees_readahead_check(fd, offset, size) && !bees_readahead_check(fd2, offset2, size2)) return;
BEESNOTE("waiting to readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size) << ","
<< "\n\t" << name_fd(fd2) << " offset " << to_hex(offset2) << " len " << pretty(size2));
unique_lock<mutex> m_lock(s_only_one);
bees_readahead_nolock(fd, offset, size);
bees_readahead_nolock(fd2, offset2, size2);
}
void
bees_readahead(int const fd, const off_t offset, const size_t size)
{
if (!bees_readahead_check(fd, offset, size)) return;
BEESNOTE("waiting to readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
unique_lock<mutex> m_lock(s_only_one);
bees_readahead_nolock(fd, offset, size);
}
void
bees_unreadahead(int const fd, off_t offset, size_t size)
{
@@ -259,6 +307,48 @@ bees_unreadahead(int const fd, off_t offset, size_t size)
BEESCOUNTADD(readahead_unread_ms, unreadahead_timer.age() * 1000);
}
static double bees_throttle_factor = 0.0;
void
bees_throttle(const double time_used, const char *const context)
{
static mutex s_mutex;
unique_lock<mutex> throttle_lock(s_mutex);
struct time_pair {
double time_used = 0;
double time_count = 0;
double longest_sleep_time = 0;
};
static map<string, time_pair> s_time_map;
auto &this_time = s_time_map[context];
auto &this_time_used = this_time.time_used;
auto &this_time_count = this_time.time_count;
auto &longest_sleep_time = this_time.longest_sleep_time;
this_time_used += time_used;
++this_time_count;
// Keep the timing data fresh
static Timer s_fresh_timer;
if (s_fresh_timer.age() > 60) {
s_fresh_timer.reset();
this_time_count *= 0.9;
this_time_used *= 0.9;
}
// Wait for enough data to calculate rates
if (this_time_used < 1.0 || this_time_count < 1.0) return;
const auto avg_time = this_time_used / this_time_count;
const auto sleep_time = min(60.0, bees_throttle_factor * avg_time - time_used);
if (sleep_time <= 0) {
return;
}
if (sleep_time > longest_sleep_time) {
BEESLOGDEBUG(context << ": throttle delay " << sleep_time << " s, time used " << time_used << " s, avg time " << avg_time << " s");
longest_sleep_time = sleep_time;
}
throttle_lock.unlock();
BEESNOTE(context << ": throttle delay " << sleep_time << " s, time used " << time_used << " s, avg time " << avg_time << " s");
nanosleep(sleep_time);
}
thread_local random_device bees_random_device;
thread_local uniform_int_distribution<default_random_engine::result_type> bees_random_seed_dist(
numeric_limits<default_random_engine::result_type>::min(),
@@ -304,6 +394,73 @@ BeesStringFile::read()
return read_string(fd, st.st_size);
}
static
void
bees_fsync(int const fd)
{
// Note that when btrfs renames a temporary over an existing file,
// it flushes the temporary, so we get the right behavior if we
// just do nothing here (except when the file is first created;
// however, in that case the result is the same as if the file
// did not exist, was empty, or was filled with garbage).
//
// Kernel versions prior to 5.16 had bugs which would put ghost
// dirents in $BEESHOME if there was a crash when we called
// fsync() here.
//
// Some other filesystems will throw our data away if we don't
// call fsync, so we do need to call fsync() on those filesystems.
//
// Newer btrfs kernel versions rely on fsync() to report
// unrecoverable write errors. If we don't check the fsync()
// result, we'll lose the data when we rename(). Kernel 6.2 added
// a number of new root causes for the class of "unrecoverable
// write errors" so we need to check this now.
BEESNOTE("checking filesystem type for " << name_fd(fd));
// LSB deprecated statfs without providing a replacement that
// can fill in the f_type field.
struct statfs stf = { 0 };
DIE_IF_NON_ZERO(fstatfs(fd, &stf));
if (stf.f_type != BTRFS_SUPER_MAGIC) {
BEESLOGONCE("Using fsync on non-btrfs filesystem type " << to_hex(stf.f_type));
BEESNOTE("fsync non-btrfs " << name_fd(fd));
DIE_IF_NON_ZERO(fsync(fd));
return;
}
static bool did_uname = false;
static bool do_fsync = false;
if (!did_uname) {
Uname uname;
const string version(uname.release);
static const regex version_re(R"/(^(\d+)\.(\d+)\.)/", regex::optimize | regex::ECMAScript);
smatch m;
// Last known bug in the fsync-rename use case was fixed in kernel 5.16
static const auto min_major = 5, min_minor = 16;
if (regex_search(version, m, version_re)) {
const auto major = stoul(m[1]);
const auto minor = stoul(m[2]);
if (tie(major, minor) > tie(min_major, min_minor)) {
BEESLOGONCE("Using fsync on btrfs because kernel version is " << major << "." << minor);
do_fsync = true;
} else {
BEESLOGONCE("Not using fsync on btrfs because kernel version is " << major << "." << minor);
}
} else {
BEESLOGONCE("Not using fsync on btrfs because can't parse kernel version '" << version << "'");
}
did_uname = true;
}
if (do_fsync) {
BEESNOTE("fsync btrfs " << name_fd(fd));
DIE_IF_NON_ZERO(fsync(fd));
}
}
void
BeesStringFile::write(string contents)
{
@@ -319,19 +476,8 @@ BeesStringFile::write(string contents)
Fd ofd = openat_or_die(m_dir_fd, tmpname, FLAGS_CREATE_FILE, S_IRUSR | S_IWUSR);
BEESNOTE("writing " << tmpname << " in " << name_fd(m_dir_fd));
write_or_die(ofd, contents);
#if 0
// This triggers too many btrfs bugs. I wish I was kidding.
// Forget snapshots, balance, compression, and dedupe:
// the system call you have to fear on btrfs is fsync().
// Also note that when bees renames a temporary over an
// existing file, it flushes the temporary, so we get
// the right behavior if we just do nothing here
// (except when the file is first created; however,
// in that case the result is the same as if the file
// did not exist, was empty, or was filled with garbage).
BEESNOTE("fsyncing " << tmpname << " in " << name_fd(m_dir_fd));
DIE_IF_NON_ZERO(fsync(ofd));
#endif
bees_fsync(ofd);
}
BEESNOTE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
BEESTRACE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
@@ -355,6 +501,8 @@ BeesTempFile::resize(off_t offset)
// Count time spent here
BEESCOUNTADD(tmp_resize_ms, resize_timer.age() * 1000);
bees_throttle(resize_timer.age(), "tmpfile_resize");
}
void
@@ -490,6 +638,8 @@ BeesTempFile::make_copy(const BeesFileRange &src)
}
BEESCOUNTADD(tmp_copy_ms, copy_timer.age() * 1000);
bees_throttle(copy_timer.age(), "tmpfile_copy");
BEESCOUNT(tmp_copy);
return rv;
}
@@ -528,19 +678,23 @@ operator<<(ostream &os, const siginfo_t &si)
static sigset_t new_sigset, old_sigset;
static
void
block_term_signal()
block_signals()
{
BEESLOGDEBUG("Masking signals");
DIE_IF_NON_ZERO(sigemptyset(&new_sigset));
DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGTERM));
DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGINT));
DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGUSR1));
DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGUSR2));
DIE_IF_NON_ZERO(sigprocmask(SIG_BLOCK, &new_sigset, &old_sigset));
}
static
void
wait_for_term_signal()
wait_for_signals()
{
BEESNOTE("waiting for signals");
BEESLOGDEBUG("Waiting for signals...");
@@ -557,14 +711,28 @@ wait_for_term_signal()
THROW_ERRNO("sigwaitinfo errno = " << errno);
} else {
BEESLOGNOTICE("Received signal " << rv << " info " << info);
// Unblock so we die immediately if signalled again
DIE_IF_NON_ZERO(sigprocmask(SIG_BLOCK, &old_sigset, &new_sigset));
break;
// If SIGTERM or SIGINT, unblock so we die immediately if signalled again
switch (info.si_signo) {
case SIGUSR1:
BEESLOGNOTICE("Received SIGUSR1 - pausing workers");
TaskMaster::pause(true);
break;
case SIGUSR2:
BEESLOGNOTICE("Received SIGUSR2 - unpausing workers");
TaskMaster::pause(false);
break;
case SIGTERM:
case SIGINT:
default:
DIE_IF_NON_ZERO(sigprocmask(SIG_BLOCK, &old_sigset, &new_sigset));
BEESLOGDEBUG("Signal catcher exiting");
return;
}
}
}
BEESLOGDEBUG("Signal catcher exiting");
}
static
int
bees_main(int argc, char *argv[])
{
@@ -573,7 +741,7 @@ bees_main(int argc, char *argv[])
BEESLOGDEBUG("exception (ignored): " << s);
BEESCOUNT(exception_caught_silent);
} else {
BEESLOGNOTICE("\n\n*** EXCEPTION ***\n\t" << s << "\n***\n");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: EXCEPTION: " << s);
BEESCOUNT(exception_caught);
}
});
@@ -588,47 +756,51 @@ bees_main(int argc, char *argv[])
// Have to block signals now before we create a bunch of threads
// so the threads will also have the signals blocked.
block_term_signal();
block_signals();
// Create a context so we can apply configuration to it
shared_ptr<BeesContext> bc = make_shared<BeesContext>();
BEESLOGDEBUG("context constructed");
string cwd(readlink_or_die("/proc/self/cwd"));
// Defaults
bool use_relative_paths = false;
bool chatter_prefix_timestamp = true;
double thread_factor = 0;
unsigned thread_count = 0;
unsigned thread_min = 0;
double load_target = 0;
bool workaround_btrfs_send = false;
BeesRoots::ScanMode root_scan_mode = BeesRoots::SCAN_MODE_INDEPENDENT;
BeesRoots::ScanMode root_scan_mode = BeesRoots::SCAN_MODE_EXTENT;
// Configure getopt_long
// Options with no short form
enum {
BEES_OPT_THROTTLE_FACTOR = 256,
};
static const struct option long_options[] = {
{ "thread-factor", required_argument, NULL, 'C' },
{ "thread-min", required_argument, NULL, 'G' },
{ "strip-paths", no_argument, NULL, 'P' },
{ "no-timestamps", no_argument, NULL, 'T' },
{ "workaround-btrfs-send", no_argument, NULL, 'a' },
{ "thread-count", required_argument, NULL, 'c' },
{ "loadavg-target", required_argument, NULL, 'g' },
{ "help", no_argument, NULL, 'h' },
{ "scan-mode", required_argument, NULL, 'm' },
{ "absolute-paths", no_argument, NULL, 'p' },
{ "timestamps", no_argument, NULL, 't' },
{ "verbose", required_argument, NULL, 'v' },
{ 0, 0, 0, 0 },
{ .name = "thread-factor", .has_arg = required_argument, .val = 'C' },
{ .name = "throttle-factor", .has_arg = required_argument, .val = BEES_OPT_THROTTLE_FACTOR },
{ .name = "thread-min", .has_arg = required_argument, .val = 'G' },
{ .name = "strip-paths", .has_arg = no_argument, .val = 'P' },
{ .name = "no-timestamps", .has_arg = no_argument, .val = 'T' },
{ .name = "workaround-btrfs-send", .has_arg = no_argument, .val = 'a' },
{ .name = "thread-count", .has_arg = required_argument, .val = 'c' },
{ .name = "loadavg-target", .has_arg = required_argument, .val = 'g' },
{ .name = "help", .has_arg = no_argument, .val = 'h' },
{ .name = "scan-mode", .has_arg = required_argument, .val = 'm' },
{ .name = "absolute-paths", .has_arg = no_argument, .val = 'p' },
{ .name = "timestamps", .has_arg = no_argument, .val = 't' },
{ .name = "verbose", .has_arg = required_argument, .val = 'v' },
{ 0 },
};
// Build getopt_long's short option list from the long_options table.
// While we're at it, make sure we didn't duplicate any options.
string getopt_list;
set<decltype(option::val)> option_vals;
map<decltype(option::val), string> option_vals;
for (const struct option *op = long_options; op->val; ++op) {
THROW_CHECK1(runtime_error, op->val, !option_vals.count(op->val));
option_vals.insert(op->val);
const auto ins_rv = option_vals.insert(make_pair(op->val, op->name));
THROW_CHECK1(runtime_error, op->val, ins_rv.second);
if ((op->val & 0xff) != op->val) {
continue;
}
@@ -639,27 +811,31 @@ bees_main(int argc, char *argv[])
}
// Parse options
int c;
while (true) {
int option_index = 0;
c = getopt_long(argc, argv, getopt_list.c_str(), long_options, &option_index);
const auto c = getopt_long(argc, argv, getopt_list.c_str(), long_options, &option_index);
if (-1 == c) {
break;
}
BEESLOGDEBUG("Parsing option '" << static_cast<char>(c) << "'");
// getopt_long should have weeded out any invalid options,
// so we can go ahead and throw here
BEESLOGDEBUG("Parsing option '" << option_vals.at(c) << "'");
switch (c) {
case 'C':
thread_factor = stod(optarg);
break;
case BEES_OPT_THROTTLE_FACTOR:
bees_throttle_factor = stod(optarg);
break;
case 'G':
thread_min = stoul(optarg);
break;
case 'P':
crucible::set_relative_path(cwd);
use_relative_paths = true;
break;
case 'T':
chatter_prefix_timestamp = false;
@@ -677,7 +853,7 @@ bees_main(int argc, char *argv[])
root_scan_mode = static_cast<BeesRoots::ScanMode>(stoul(optarg));
break;
case 'p':
crucible::set_relative_path("");
use_relative_paths = false;
break;
case 't':
chatter_prefix_timestamp = true;
@@ -695,12 +871,12 @@ bees_main(int argc, char *argv[])
case 'h':
default:
do_cmd_help(argv);
return EXIT_FAILURE;
return EXIT_SUCCESS;
}
}
if (optind + 1 != argc) {
BEESLOGERR("Only one filesystem path per bees process");
BEESLOGERR("Exactly one filesystem path required");
return EXIT_FAILURE;
}
@@ -740,22 +916,32 @@ bees_main(int argc, char *argv[])
BEESLOGNOTICE("setting worker thread pool maximum size to " << thread_count);
TaskMaster::set_thread_count(thread_count);
BEESLOGNOTICE("setting throttle factor to " << bees_throttle_factor);
// Set root path
string root_path = argv[optind++];
BEESLOGNOTICE("setting root path to '" << root_path << "'");
bc->set_root_path(root_path);
// Set path prefix
if (use_relative_paths) {
crucible::set_relative_path(name_fd(bc->root_fd()));
}
// Workaround for btrfs send
bc->roots()->set_workaround_btrfs_send(workaround_btrfs_send);
// Set root scan mode
bc->roots()->set_scan_mode(root_scan_mode);
// Workaround for the logical-ino-vs-clone kernel bug
MultiLocker::enable_locking(true);
// Start crawlers
bc->start();
// Now we just wait forever
wait_for_term_signal();
wait_for_signals();
// Shut it down
bc->stop();

View File

@@ -78,13 +78,13 @@ const int BEES_PROGRESS_INTERVAL = BEES_STATS_INTERVAL;
const int BEES_STATUS_INTERVAL = 1;
// Number of file FDs to cache when not in active use
const size_t BEES_FILE_FD_CACHE_SIZE = 4096;
const size_t BEES_FILE_FD_CACHE_SIZE = 524288;
// Number of root FDs to cache when not in active use
const size_t BEES_ROOT_FD_CACHE_SIZE = 1024;
const size_t BEES_ROOT_FD_CACHE_SIZE = 65536;
// Number of FDs to open (rlimit)
const size_t BEES_OPEN_FILE_LIMIT = (BEES_FILE_FD_CACHE_SIZE + BEES_ROOT_FD_CACHE_SIZE) * 2 + 100;
const size_t BEES_OPEN_FILE_LIMIT = BEES_FILE_FD_CACHE_SIZE + BEES_ROOT_FD_CACHE_SIZE + 100;
// Worker thread factor (multiplied by detected number of CPU cores)
const double BEES_DEFAULT_THREAD_FACTOR = 1.0;
@@ -93,10 +93,11 @@ const double BEES_DEFAULT_THREAD_FACTOR = 1.0;
const double BEES_TOO_LONG = 5.0;
// Avoid any extent where LOGICAL_INO takes this much kernel CPU time
const double BEES_TOXIC_SYS_DURATION = 0.1;
const double BEES_TOXIC_SYS_DURATION = 5.0;
// Maximum number of refs to a single extent
const size_t BEES_MAX_EXTENT_REF_COUNT = (16 * 1024 * 1024 / 24) - 1;
// Maximum number of refs to a single extent before we have other problems
// If we have more than 10K refs to an extent, adding another will save 0.01% space
const size_t BEES_MAX_EXTENT_REF_COUNT = 9999; // (16 * 1024 * 1024 / 24);
// How long between hash table histograms
const double BEES_HASH_TABLE_ANALYZE_INTERVAL = BEES_STATS_INTERVAL;
@@ -121,9 +122,9 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
// macros ----------------------------------------
#define BEESLOG(lv,x) do { if (lv < bees_log_level) { Chatter __chatter(lv, BeesNote::get_name()); __chatter << x; } } while (0)
#define BEESLOGTRACE(x) do { BEESLOG(LOG_DEBUG, x); BeesTracer::trace_now(); } while (0)
#define BEESTRACE(x) BeesTracer SRSLY_WTF_C(beesTracer_, __LINE__) ([&]() { BEESLOG(LOG_ERR, x); })
#define BEES_TRACE_LEVEL LOG_DEBUG
#define BEESTRACE(x) BeesTracer SRSLY_WTF_C(beesTracer_, __LINE__) ([&]() { BEESLOG(BEES_TRACE_LEVEL, "TRACE: " << x << " at " << __FILE__ << ":" << __LINE__); })
#define BEESTOOLONG(x) BeesTooLong SRSLY_WTF_C(beesTooLong_, __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
#define BEESNOTE(x) BeesNote SRSLY_WTF_C(beesNote_, __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
@@ -133,6 +134,14 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
#define BEESLOGINFO(x) BEESLOG(LOG_INFO, x)
#define BEESLOGDEBUG(x) BEESLOG(LOG_DEBUG, x)
#define BEESLOGONCE(__x) do { \
static bool already_logged = false; \
if (!already_logged) { \
already_logged = true; \
BEESLOGNOTICE(__x); \
} \
} while (false)
#define BEESCOUNT(stat) do { \
BeesStats::s_global.add_count(#stat); \
} while (0)
@@ -184,7 +193,7 @@ class BeesTracer {
thread_local static bool tl_silent;
thread_local static bool tl_first;
public:
BeesTracer(function<void()> f, bool silent = false);
BeesTracer(const function<void()> &f, bool silent = false);
~BeesTracer();
static void trace_now();
static bool get_silent();
@@ -299,6 +308,11 @@ public:
off_t grow_begin(off_t delta);
/// @}
/// @{ Make range smaller
off_t shrink_end(off_t delta);
off_t shrink_begin(off_t delta);
/// @}
friend ostream & operator<<(ostream &os, const BeesFileRange &bfr);
};
@@ -515,7 +529,7 @@ class BeesCrawl {
bool fetch_extents();
void fetch_extents_harder();
bool next_transid();
bool restart_crawl_unlocked();
BeesFileRange bti_to_bfr(const BtrfsTreeItem &bti) const;
public:
@@ -527,6 +541,9 @@ public:
BeesCrawlState get_state_end() const;
void set_state(const BeesCrawlState &bcs);
void deferred(bool def_setting);
bool deferred() const;
bool finished() const;
bool restart_crawl();
};
class BeesScanMode;
@@ -535,7 +552,8 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
shared_ptr<BeesContext> m_ctx;
BeesStringFile m_crawl_state_file;
map<uint64_t, shared_ptr<BeesCrawl>> m_root_crawl_map;
using CrawlMap = map<uint64_t, shared_ptr<BeesCrawl>>;
CrawlMap m_root_crawl_map;
mutex m_mutex;
uint64_t m_crawl_dirty = 0;
uint64_t m_crawl_clean = 0;
@@ -554,17 +572,13 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
condition_variable m_stop_condvar;
bool m_stop_requested = false;
void insert_new_crawl();
void insert_root(const BeesCrawlState &bcs);
CrawlMap insert_new_crawl();
Fd open_root_nocache(uint64_t root);
Fd open_root_ino_nocache(uint64_t root, uint64_t ino);
uint64_t transid_min();
uint64_t transid_max();
uint64_t transid_max_nocache();
void state_load();
ostream &state_to_stream(ostream &os);
void state_save();
bool crawl_roots();
string crawl_state_filename() const;
void crawl_state_set_dirty();
void crawl_state_erase(const BeesCrawlState &bcs);
@@ -572,13 +586,16 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
void writeback_thread();
uint64_t next_root(uint64_t root = 0);
void current_state_set(const BeesCrawlState &bcs);
RateEstimator& transid_re();
bool crawl_batch(shared_ptr<BeesCrawl> crawl);
void clear_caches();
shared_ptr<BeesCrawl> insert_root(const BeesCrawlState &bcs);
bool up_to_date(const BeesCrawlState &bcs);
friend class BeesCrawl;
friend class BeesFdCache;
friend class BeesScanMode;
friend class BeesScanModeSubvol;
friend class BeesScanModeExtent;
public:
BeesRoots(shared_ptr<BeesContext> ctx);
@@ -594,17 +611,22 @@ public:
Fd open_root_ino(const BeesFileId &bfi) { return open_root_ino(bfi.root(), bfi.ino()); }
bool is_root_ro(uint64_t root);
// TODO: do extent-tree scans instead
enum ScanMode {
SCAN_MODE_LOCKSTEP,
SCAN_MODE_INDEPENDENT,
SCAN_MODE_SEQUENTIAL,
SCAN_MODE_RECENT,
SCAN_MODE_EXTENT,
SCAN_MODE_COUNT, // must be last
};
void set_scan_mode(ScanMode new_mode);
void set_workaround_btrfs_send(bool do_avoid);
uint64_t transid_min();
uint64_t transid_max();
void wait_for_transid(const uint64_t count);
};
struct BeesHash {
@@ -664,6 +686,8 @@ class BeesRangePair : public pair<BeesFileRange, BeesFileRange> {
public:
BeesRangePair(const BeesFileRange &src, const BeesFileRange &dst);
bool grow(shared_ptr<BeesContext> ctx, bool constrained);
void shrink_begin(const off_t delta);
void shrink_end(const off_t delta);
BeesRangePair copy_closed() const;
bool operator<(const BeesRangePair &that) const;
friend ostream & operator<<(ostream &os, const BeesRangePair &brp);
@@ -737,11 +761,14 @@ class BeesContext : public enable_shared_from_this<BeesContext> {
shared_ptr<BeesThread> m_progress_thread;
shared_ptr<BeesThread> m_status_thread;
mutex m_progress_mtx;
string m_progress_str;
void set_root_fd(Fd fd);
BeesResolveAddrResult resolve_addr_uncached(BeesAddress addr);
BeesFileRange scan_one_extent(const BeesFileRange &bfr, const Extent &e);
void scan_one_extent(const BeesFileRange &bfr, const Extent &e);
void rewrite_file_range(const BeesFileRange &bfr);
public:
@@ -772,6 +799,8 @@ public:
void dump_status();
void show_progress();
void set_progress(const string &str);
string get_progress();
void start();
void stop();
@@ -834,7 +863,7 @@ public:
BeesFileRange find_one_match(BeesHash hash);
void replace_src(const BeesFileRange &src_bfr);
BeesFileRange replace_dst(const BeesFileRange &dst_bfr);
BeesRangePair replace_dst(const BeesFileRange &dst_bfr);
bool found_addr() const { return m_found_addr; }
bool found_data() const { return m_found_data; }
@@ -868,7 +897,10 @@ extern const char *BEES_VERSION;
extern thread_local default_random_engine bees_generator;
string pretty(double d);
void bees_readahead(int fd, off_t offset, size_t size);
void bees_readahead_pair(int fd, off_t offset, size_t size, int fd2, off_t offset2, size_t size2);
void bees_unreadahead(int fd, off_t offset, size_t size);
void bees_throttle(double time_used, const char *context);
string format_time(time_t t);
bool exception_check();
#endif

View File

@@ -8,6 +8,7 @@ PROGRAMS = \
process \
progress \
seeker \
table \
task \
all: test

View File

@@ -19,7 +19,9 @@ seeker_finder(const vector<uint64_t> &vec, uint64_t lower, uint64_t upper)
if (ub != s.end()) ++ub;
if (ub != s.end()) ++ub;
for (; ub != s.end(); ++ub) {
if (*ub > upper) break;
if (*ub > upper) {
break;
}
}
return set<uint64_t>(lb, ub);
}
@@ -28,7 +30,7 @@ static bool test_fails = false;
static
void
seeker_test(const vector<uint64_t> &vec, uint64_t const target)
seeker_test(const vector<uint64_t> &vec, uint64_t const target, bool const always_out = false)
{
cerr << "Find " << target << " in {";
for (auto i : vec) {
@@ -36,11 +38,13 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
}
cerr << " } = ";
size_t loops = 0;
tl_seeker_debug_str = make_shared<ostringstream>();
bool local_test_fails = false;
bool excepted = catch_all([&]() {
auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
const auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
++loops;
return seeker_finder(vec, lower, upper);
});
}, uint64_t(32));
cerr << found;
uint64_t my_found = 0;
for (auto i : vec) {
@@ -52,13 +56,15 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
cerr << " (correct)";
} else {
cerr << " (INCORRECT - right answer is " << my_found << ")";
test_fails = true;
local_test_fails = true;
}
});
cerr << " (" << loops << " loops)" << endl;
if (excepted) {
test_fails = true;
if (excepted || local_test_fails || always_out) {
cerr << dynamic_pointer_cast<ostringstream>(tl_seeker_debug_str)->str();
}
test_fails = test_fails || local_test_fails;
tl_seeker_debug_str.reset();
}
static
@@ -89,6 +95,39 @@ test_seeker()
seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max());
seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max() - 1);
seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() - 1 }, numeric_limits<uint64_t>::max());
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 0);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 1);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 2);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 3);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 4);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 5);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 6);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 7);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 8);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 9);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 1 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 2 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 3 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 4 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 5 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 6 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 7 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 8 );
// Pulled from a bees debug log
seeker_test(vector<uint64_t> {
6821962845,
6821962848,
6821963411,
6821963422,
6821963536,
6821963539,
6821963835, // <- appeared during the search, causing an exception
6821963841,
6822575316,
}, 6821971036, true);
}

63
test/table.cc Normal file
View File

@@ -0,0 +1,63 @@
#include "tests.h"
#include "crucible/table.h"
using namespace crucible;
using namespace std;
void
print_table(const Table::Table& t)
{
cerr << "BEGIN TABLE\n";
cerr << t;
cerr << "END TABLE\n";
cerr << endl;
}
void
test_table()
{
Table::Table t;
t.insert_row(Table::endpos, vector<Table::Content> {
Table::Text("Hello, World!"),
Table::Text("2"),
Table::Text("3"),
Table::Text("4"),
});
print_table(t);
t.insert_row(Table::endpos, vector<Table::Content> {
Table::Text("Greeting"),
Table::Text("two"),
Table::Text("three"),
Table::Text("four"),
});
print_table(t);
t.insert_row(Table::endpos, vector<Table::Content> {
Table::Fill('-'),
Table::Text("ii"),
Table::Text("iii"),
Table::Text("iv"),
});
print_table(t);
t.mid(" | ");
t.left("| ");
t.right(" |");
print_table(t);
t.insert_col(1, vector<Table::Content> {
Table::Text("1"),
Table::Text("one"),
Table::Text("i"),
Table::Text("I"),
});
print_table(t);
t.at(2, 1) = Table::Text("Two\nLines");
print_table(t);
}
int
main(int, char**)
{
RUN_A_TEST(test_table());
exit(EXIT_SUCCESS);
}