1
0
mirror of https://github.com/Zygo/bees.git synced 2025-08-01 13:23:28 +02:00

56 Commits

Author SHA1 Message Date
Zygo Blaxell
3a17a4dcdd tempfile: make sure FS_COMPR_FL stays set
btrfs will set the FS_NOCOMP_FL flag when all of the following are true:

1.  The filesystem is not mounted with the `compress-force` option
2.  Heuristic analysis of the data suggests the data is compressible
3.  Compression fails to produce a result that is smaller than the original

If the compression ratio is 40%, and the original data is 128K long,
then compressed data will be about 52K long (rounded up to 4K), so item
3 is usually false; however, if the original data is 8K long, then the
compressed data will be 8K long too, and btrfs will set FS_NOCOMP_FL.

To work around that, keep setting FS_COMPR_FL and clearing FS_NOCOMP_FL
every time a TempFile is reset.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-29 23:25:36 -04:00
Zygo Blaxell
4039ef229e tempfile: clear FS_NOCOW_FL while setting FS_COMPR_FL
FS_NOCOW_FL can be inherited from the subvol root directory, and it
conflicts with FS_COMPR_FL.

We can only dedupe when FS_NOCOW_FL is the same on src and dst, which
means we can only dedupe when FS_NOCOW_FL is clear, so we should clear
FS_NOCOW_FL on the temporary files we create for dedupe.

Fixes: https://github.com/Zygo/bees/issues/314
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-29 23:24:55 -04:00
Zygo Blaxell
e9d4aa4586 roots: make the "idle" label useful
Apply the "idle" label only when the crawl is finished _and_ its
transid_max is up to date.  This makes the keyword "idle" better reflect
when bees is not only finished crawling, but also scanning the crawled
extents in the queue.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 23:06:14 -04:00
Zygo Blaxell
504f4cda80 progress: move the "idle" cell to the next cycle ETA column
When all extents within a size tier have been queued, and all the
extents belong to the same file, the queue might take a long time to
fully process.  Also, any progress that is made will be obscured by
the "idle" tag in the "point" column.

Move "idle" to the next cycle ETA column, since the ETA duration will
be zero, and no useful information is lost since we would have "-"
there anyway.

Since the "point" column can now display the maximum value, lower
that maximum to 999999 so that we don't use an extra column.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 22:33:05 -04:00
Zygo Blaxell
6c36f4973f extent scan: log the bfr when removing a prealloc extent
With subvol scan, the crawl task name is the subvol/inode pair
corresponding to the file offset in the log message.  The identity of
the file can be determined by looking up the subvol/inode pair in the
log message.

With extent scan, the crawl task name is the extent bytenr corresponding
to the file offset in the log message.  This extent is deleted when the
log message is emitted, so a later lookup on the extent bytenr will not
find any references to the extent, and the identity of the file cannot
be determined.

Log the bfr, which does a /proc lookup on the name of the fd, so the
filename is logged.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 22:33:05 -04:00
Zygo Blaxell
b1bd99c077 seeker: harden against changes in the data during binary search
During the search, the region between `upper_bound` and `target_pos`
should contain no data items.  The search lowers `upper_bound` and raises
`lower_bound` until they both point to the last item before `target_pos`.

The `lower_bound` is increased to the position of the last item returned
by a search (`high_pos`) when that item is lower than `target_pos`.
This avoids some loop iterations compared to a strict binary search
algorithm, which would increase `lower_bound` only as far as `probe_pos`.

When the search runs over live extent items, occasionally a new extent
will appear between `upper_bound` and `target_pos`.  When this happens,
`lower_bound` is bumped up to the position of one of the new items, but
that position is in the "unoccupied" space between `upper_bound` and
`target_pos`, where no items are supposed to exist, so `seek_backward`
throws an exception.

To cut down on the noise, only increase `lower_bound` as far as
`upper_bound`.  This avoids the exception without increasing the number
of loop iterations for normal cases.

In the exceptional cases, extra loop iterations are needed to skip over
the new items.  This raises the worst-case number of loop iterations
by one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
d5e805ab8d seeker: add a real-world test case
This seek_backward failed in bees because an extent appeared during
the search:

	fetch(probe_pos = 6821971036, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821971004 = probe_pos - have_delta 32 (want_delta 32)
	fetch(probe_pos = 6821971004, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970972 = probe_pos - have_delta 32 (want_delta 32)
	fetch(probe_pos = 6821970972, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970908 = probe_pos - have_delta 64 (want_delta 64)
	fetch(probe_pos = 6821970908, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970780 = probe_pos - have_delta 128 (want_delta 128)
	fetch(probe_pos = 6821970780, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970524 = probe_pos - have_delta 256 (want_delta 256)
	fetch(probe_pos = 6821970524, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970012 = probe_pos - have_delta 512 (want_delta 512)
	fetch(probe_pos = 6821970012, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821968988 = probe_pos - have_delta 1024 (want_delta 1024)
	fetch(probe_pos = 6821968988, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821966940 = probe_pos - have_delta 2048 (want_delta 2048)
	fetch(probe_pos = 6821966940, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821962844 = probe_pos - have_delta 4096 (want_delta 4096)
	fetch(probe_pos = 6821962844, target_pos = 6821971036)
	 = 6821962845..6821962848
	found_low = true, lower_bound = 6821962845
	lower_bound = high_pos 6821962848
	loop: lower_bound 6821962848, probe_pos 6821966942, upper_bound 6821971036
	fetch(probe_pos = 6821966942, target_pos = 6821971036)
	 = 6822575316..6822575316
	upper_bound = probe_pos 6821966942
	loop: lower_bound 6821962848, probe_pos 6821964895, upper_bound 6821966942
	fetch(probe_pos = 6821964895, target_pos = 6821971036)
	 = 6822575316..6822575316
	upper_bound = probe_pos 6821964895
	loop: lower_bound 6821962848, probe_pos 6821963871, upper_bound 6821964895
	fetch(probe_pos = 6821963871, target_pos = 6821971036)
	 = 6822575316..6822575316
	upper_bound = probe_pos 6821963871
	loop: lower_bound 6821962848, probe_pos 6821963359, upper_bound 6821963871
	fetch(probe_pos = 6821963359, target_pos = 6821971036)
	 = 6821963411..6821963422
	lower_bound = high_pos 6821963422
	loop: lower_bound 6821963422, probe_pos 6821963646, upper_bound 6821963871
	fetch(probe_pos = 6821963646, target_pos = 6821971036)
	 = 6822575316..6822575316

Here, we found nothing between 6821963646 and 6822575316, so upper_bound is reduced
to 6821963646...

	upper_bound = probe_pos 6821963646
	loop: lower_bound 6821963422, probe_pos 6821963534, upper_bound 6821963646
	fetch(probe_pos = 6821963534, target_pos = 6821971036)
	 = 6821963536..6821963539
	lower_bound = high_pos 6821963539
	loop: lower_bound 6821963539, probe_pos 6821963592, upper_bound 6821963646
	fetch(probe_pos = 6821963592, target_pos = 6821971036)
	 = 6821963835..6821963841

...but here, we found 6821963835 and 6821963841, which are between
6821963646 and 6822575316.  They were not there before, so the binary
search result is now invalid because new extent items were added while
it was running.  This results in an exception:

	lower_bound = high_pos 6821963841
	--- BEGIN TRACE --- exception ---
	objectid = 27942759813120, adjusted to 27942793363456 at bees-roots.cc:1103
	Crawling extent BeesCrawlState 250:0 offset 0x0 transid 1311734..1311735 at bees-roots.cc:991
	get_state_end at bees-roots.cc:988
	find_next_extent 250 at bees-roots.cc:929
	---  END  TRACE --- exception ---
	*** EXCEPTION ***
	exception type std::out_of_range: lower_bound = 6821963841, upper_bound = 6821963646 failed constraint check (lower_bound <= upper_bound) at ../include/crucible/seeker.h:139

The exception prevents the result of seek_backward from returning a value,
which prevents a nonsense result from a consumer of that value.

Copy the details of this search into a test case.  Note that the test
case won't reproduce the exception because the simulation of fetch()
is not changing the results part way through.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
337bbffac1 extent scan: drop a nonsense trace message
This message appears only during exception backtraces, but it doesn't
carry any useful information.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
527396e5cb extent scan: integrate seeker debug output stream
Send both tree_search ioctl and `seek_backward` debug logs to the
same output stream, but only write that stream to the debug log if
there is an exception.

The feature remains disabled at compile time.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
bc7c35aa2d extent scan: only write a detailed debug log when there's an exception
Note that when enabled, the logs are still very CPU-intensive,
but most of the logs will be discarded.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
0953160584 trace: export exception_check
We need to call this from more than one place in bees.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
80f9c147f7 btrfs-tree: clean up the fetch function's return set
Commit d32f31f411 ("btrfs-tree: harden
`rlower_bound` against exceptional objects") passes the first btrfs item
in the result set that is above upper_bound up to `seek_backward`.
This is somewhat wasteful as `seek_backward` cannot use such a result.

Reverse that change in behavior, while keeping the rest of the other
commit.

This introduces a new case, where the search ioctl is producing items
that are above upper bound, but there are no items in the result set,
which continues looping until the end of the filesystem is reached.
Handle that by setting an explicit exit variable.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
50e012ad6d seeker: add a runtime debug stream
This allows detailed but selective debugging when using the library,
particularly when something goes wrong.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
9a9644659c trace: clean up the formatting around top-level exception log messages
Fewer newlines.  More consistent application of the "TRACE:" prefix.
All at the same log level.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
fd53bff959 extent scan: drop out-of-date comment
The comment describes an earlier version which submitted each extent
ref as a separate Task, but now all extent refs are handled by the same
Task to minimize the amount of time between processing the first and
last reference to an extent.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
9439dad93a extent scan: extra check to make sure no Tasks are started when throttled
Previously `scan()` would run the extent scan loop once, and enqueue one
extent, before checking for throttling.  Do an extra check before that,
and bail out so that zero extents are enqueued when throttled.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
ef9b4b3a50 extent scan: shorten task name for extent map
Linux kernel thread names are hardcoded at 16 characters.  Every character
counts, and "0x" wastes two.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
7ca857dff0 docs: add the ghost subvols bug to the bugs list
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
8331f70db7 progress: fix ETA calculations
The "tm_left" field was the estimated _total_ duration of the crawl,
not the amount of time remaining.  The ETA timestamp was then calculated
based on the estimated time to run the crawl if it started _now_, not
at the start timestamp.

Fix the duration and ETA calculations.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Steven Allen
a844024395 Make the runtime directory private
The status file contains sensitive information like filenames and duplicate chunk ranges. It might also make sense to set the process-wide `UMask=`, but that may have other unintended side effects.
2025-03-26 15:02:42 +00:00
Zygo Blaxell
47243aef14 hash: handle $BEESHOME on btrfs too
The `_nothrow` variants of `do_ioctl` return true when they succeed,
which is the opposite of what `ioctl` does.

Fix the logic so bees can correctly identify its own hash table when
it's on the same filesystem as the target.

Fixes: f6908420ad ("hash: handle $BEESHOME on non-btrfs")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-17 21:18:08 -05:00
Zygo Blaxell
a670aa5a71 extent scan: don't divide by zero if there were no loops
Commit 183b6a5361 ("extent scan: refactor
BeesCrawl, BeesScanMode*") moved some statistics calculations out of
the loop in `find_next_extent`, but did not ensure that the statistics
would not be calculated if the loop had not executed any iterations.

In rare instances, the function returns without entering the loop at all,
which results in divide by zero.  Add a check just before doing that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
51b3bcdbe4 trace: deprecate BEESLOGTRACE, align trace logs with exception notices
Exceptions were logged at level NOTICE while the stack traces were logged
at level DEBUG.  That produced useless noise in the output with `-v5`
or `-v6`, where there were exception headings logged, but no details.

Fix that by placing the exceptions and traces at level DEBUG, but prefix
them with `TRACE:` for easy grepping.

Most of the events associated with BEESLOGTRACE either never happen,
or they are harmless (e.g. trying to open deleted files or subvols).
Reassign them to ordinary BEESLOGDEBUG, with one exception for
unrecognized Extent flags that should be debugged if any appear.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
ae58401d53 trace: avoid one copy in every trace function
While investigating https://github.com/Zygo/bees/issues/282 I noticed that
we're doing at least one unnecessary extra copy of the functor in BEESTRACE.
Get rid of it with a const reference.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
3e7eb43b51 BeesStringFile: figure out when to call--or _not_ call--fsync
Older kernel versions featured some bugs in btrfs `fsync`, which could
leave behind "ghost dirents", orphan filename items that did not have
a corresponding inode.  These dirents were created during log replay
during the first mount after a crash due to several different bugs in
the log tree and its use over the years.  The last known bug of this
kind was fixed in kernel 5.16.  As of this writing, no fixes for this
bug have been backported to any earlier LTS kernel.

Some filesystems, including btrfs, will flush the contents of a new
file before renaming it over an old file.  On paper, btrfs can do this
very cheaply since the contents of the new file are not referenced, and
the old file not dereferenced, until a tree commit which includes both
actions atomically; however, in real life, btrfs provides `fsync`-like
semantics and uses the log-tree infrastructure to implement them, which
compromises performance and acts as a magnet for bugs.

The benefit of this trade-off is that `rename` can be used as a
synchronization point for data outside of the btrfs, which would not
happen if everything `rename` does was simply deferred to the next
tree commit.  The cost of this trade-off is that for the first 8 years
of its existence, bees would trigger the bug so often that the project
recommended its users put $BEESHOME in its own subvol to make it easy
to remove ghost dirents left behind by the bug.

Some other filesystems, such as xfs, don't have any special semantics for
`rename`, and require `fsync` to avoid garbage or missing data after
a crash.  Even filesystems which do have a special case for `rename`
can be configured to turn it off.

btrfs will silently delete data from files in the event that an
unrecoverable data block write error occurs.  Kernel version 6.2 adds
important new and unexpected cases where this can happen on filesystems
using raid56 data, but it also happens in all usable btrfs versions
(the silent deletion behavior was introduced in kernel version 3.9).

Unrecoverable write errors are currently reported to userspace only
through `fsync`.  Since the failed extents are deleted, they cannot be
detected via csum failures or scrub after the fact--and it's too late
by then, the data is already gone.  `fsync` is the last opportunity
to detect the write failure before the `rename`.  If the error is not
detected, the contents of the file will be silently discarded in btrfs.
The impact on bees is that scans will abruptly restart from zero after
a crash combined with some other reasonably common failures.

Putting all of this together leads to a rather complex workaround:
if the filesystem under $BEESHOME (specifically, the filesystem where
BeesStringFile objects such as `beescrawl.dat` are written) is a btrfs
filesystem, and the host kernel is a version prior to 5.16, then don't
call `fsync` before `rename`.  In all other cases, do call `fsync`,
and prevent dependent writes (i.e. the following `rename`) in the event
of errors.

Since present kernel versions still require `fsync`, we don't need
an upper bound on the kernel version check until someone fixes btrfs
`rename` (or perhaps adds a flag to `renameat2` which prevents use of
the log tree) in the kernel.  Once that fix happens, we can drop the
`fsync` call for kernels after that fixed version.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-10 21:04:20 -05:00
Zygo Blaxell
962d94567c hexdump: fix pointer cast const mismatch
Another hit from the exotic compiler collection:  build fails on GCC 9,
from Ubuntu 20...but not later versions of GCC.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-10 21:00:31 -05:00
Zygo Blaxell
6dbef5f27b fs: improve compatibility with linux-libc-dev 5.4
Fix the missing symbols that popped up when adding chunk tree to
lib/fs.cc.  Also define the missing symbols instead of merely trying to
avoid them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-08 21:17:15 -05:00
Zygo Blaxell
88b1e4ca6e main: unconditionally enable workaround for the logical_ino-vs-clone kernel bug
This obviously doesn't fix or prevent the kernel bug, but it does prevent
bees from triggering the bug without assitance from another application.

The bug can still be triggered by running bees at the same time as an
application which uses clone or LOGICAL_INO.  `btdu` uses LOGICAL_INO,
while `cp` from coreutils (and many others) use clone (reflink copy).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 23:14:16 -05:00
Zygo Blaxell
c1d7fa13a5 roots: drop unnecessary mutex unlock in stop_request
In commit 31b2aa3c0d ("context: speed
up orderly process termination"), the stop request was split into two
methods after the mutex unlock.

Now that there's nothing after the mutex unlock in `stop_request`,
there's no need for an explicit unlock to do what the destructor would
have done anyway.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 23:14:16 -05:00
Zygo Blaxell
aa39bddb2d extent scan: implement an experimental ordered scan mode
Parallel scan runs each extent size tier in a separate thread.  The
threads compete to process extents within the tier's size range.

Ordered scan processes each extent size tier completely before moving on
to the next.  In theory, this means large extents always get processed
quickly, especially when new ones appear, and the queue does not fill up
with small extents.

In practice, the multi-threaded scanner massively outperforms the
single-threaded scanner, unless the number of worker threads is very
small (i.e. one).

Disable most of the feature for now, but leave the code in place so it
can be easily reactivated for future testing.

Ordered scan introduces a parallelized extent mapper Task.  Keep that in
parallel scan mode, which further enhances the parallelism.  The extent
scan crawl threads now run at 'idle' priority while the map tasks run
at normal priority, so the map tasks don't flood the task queue.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 23:14:16 -05:00
Zygo Blaxell
1aea2d2f96 crawl: deprecate use of BeesCrawl to search the extent tree
BeesScanModeExtent can do that by itself now.  Overloading the subvol
crawl code resulted in an ugly, inefficient hack, and we definitely
don't want to accidentally continue to use it.

Remove the support for reading the extent tree and add some `assert`s
to make sure it isn't still used somewhere.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
673b450671 docs: update event counters after extent scan refactoring and crawl skipping
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
183b6a5361 extent scan: refactor BeesCrawl, BeesScanMode*
The main gains here are:

* Move extent tree searches into BeesScanModeExtent so that they are
not slowed down by the BeesCrawl code, which was designed for the
much more specialized metadata in subvol trees.
* Enable short extent skipping now that BeesCrawl is out of the way.
* Stop enumerating btrfs subvols when in extent scan mode.

All this gets rid of >99% of unnecessary extent tree searches.
Incremental extent scan cycles now finish in milliseconds instead
of minutes.

BeesCrawl was never designed to cope with the structure and content of
the extent tree.  It would waste thousands of tree-search ioctl calls
reading and ignoring metadata items.

Performance was particularly bad when a binary search was involved, as any
binary search probe that landed in a metadata block group would read and
discard all the metadata items in the block group, sequentially, repeated
for each level of the binary search.  This was blocking implementation of
short extent skipping optimization for large extent size tiers, because
the skips were using thousands of tree searches to skip over only a few
hundred extent items.

Extent scan also had to read every extent item twice to do the
transid filtering, because BeesCrawl's interface discarded the relevant
information when it converted a `BtrfsTreeItem` into a `BeesFileRange`.
The cost of this extra fetch was negligible, but it could have been zero.

Fix this by:

* Copy the equivalent of `fetch_extents` from BeesCrawl into
`BeesScanModeExtent`, then give each of the extent scan crawlers its
own `BtrfsDataExtentTreeFetcher` instance.  This enables extent tree
searches to avoid pure (non-mixed) metadata block groups.  `BeesCrawl`
is now used only for its interface to `BeesRoots` for saving state in
`beescrawl.dat`, and never to determine the next extent tree item.

* Move subvol-specific parts of `BeesRoots` into a new class
`BeesScanModeSubvol` so that `BtrfsScanModeExtent` doesn't have to enable
or support them.  In particular, `bees -m4` no longer enumerates all
of the _subvol_ crawlers.  `BeesRoots` is still used to save and load
crawl state.

* Move several members from `BtrfsScanModeExtent` into a per-crawler
state object `SizeTier` to eliminate the need for some locks and to
maintain separate cache state for `BtrfsDataExtentTreeFetcher`.

* Reuse the `BtrfsTreeItem` to get the generation field for the transid
range filter.

* Avoid a few corner cases when handling errors, where extent scan might
drop an extent without scanning it, or fail to advance to the next extent.

* Enable the extent-skipping algorithm for large size tiers, now that
`BeesCrawl::fetch_extents` is no longer slowing it down.

* Add a debug stream interface which developers can easily turn on when
needed to inspect the decisions that extent scan is making.

* Track metrics that are more useful, particularly searches per extent
scanned, and fraction of extents that are skipped.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
b6446d7316 roots: rework open_root_nocache to use btrfs-tree
This gets rid of one open-coded btrfs tree search.

Also reduce the log noise level for subvol open failures, and remove
some ancient references to `BEESLOG`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
d32f31f411 btrfs-tree: harden rlower_bound against exceptional objects
Rearrange the logic in `rlower_bound` so it can cope with a tree
that contains mostly block-aligned objects, with a few exceptions
filtered out by `hdr_stop`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
dd08f6379f btrfs-tree: add a method to get root backref items to BtrfsRootFetcher
This complements the already existing support for reading the fields of
a root backref.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
58ee297cde btrfs-tree: connect methods to the debug stream interface
In some cases functions already had existing debug stream support
which can be redirected to the new interface.  In other cases, new
debug messages are added.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
a3c0ba0d69 fs: add a runtime debug stream for btrfs tree searches
This allows plugging in an ostream at run time so that we can audit all
the search calls we are doing.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
75040789c6 btrfs-tree: drop BtrfsFsTreeFetcher and clean up class comments
BtrfsFsTreeFetcher was used for early versions of the extent scanner, but
neither subvol nor extent scan now needs an object that is both persistent
and configured to access only one subvol.  BtrfsExtentDataFetcher does
the same thing in that case.

Clarify the comments on what the remaining classes do, so that
BtrfsFsTreeFetcher doesn't get inadvertently reinvented in the future.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
f9a697518d btrfs-tree: introduce BtrfsDataExtentTreeFetcher to read data extents without metadata
Binary searches can be extremely slow if the target bytenr is near a
metadata block group, because metadata items are not visible to the
binary search algorithm.  In a non-mixed-bg filesystem, there can be
hundreds of thousands of metadata items between data extent items, and
since the binary search algorithm can't see them, it will run searches
that iterate over hundreds of thousands of objects about a dozen times.

This is less of a problem for mixed-bg filesystems because the data and
metadata blocks are not isolated from each other.  The binary search
algorithm still can't see the metadata items, but there are usually
some data items close by to prevent the linear item filter from running
too long.

Introduce a new fetcher class (all the good names were taken) that tracks
where the end of the current block group is.  When the end of the current
block group is reached in the linear search, skip ahead to a block group
that can contain data items.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
c4ba6ec269 fs: add a ntoa function for chunk types
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
440740201a main: the base directory for --strip-paths should be root_fd, not cwd
The cwd is where core dumps and various profiling and verification
libraries want to write their data, whereas root_fd is the root of the
target filesystem.  These are often intentionally different.  When
they are different, `--strip-paths` sets the wrong prefix to strip
from paths.

Once the root fd has been established, we can set the path prefix to
the string prefix that we'll get from future calls to `name_fd`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
f6908420ad hash: handle $BEESHOME on non-btrfs
bees explicitly supports storing $BEESHOME on another filesystem, and
does not require that filesystem to be btrfs; however, if $BEESHOME
is on a non-btrfs filesystem, there is an exception on every startup
when trying to identify the subvol root of the hash table file in order
to blacklist it, because non-btrfs filesystems don't have subvol roots.

Fix by checking not only whether $BEESHOME is on btrfs, but whether it
is on the _same_ btrfs, as the bees root, without throwing an exception.
The hash table is blacklisted only when both filesystems are btrfs and
have the same fsid.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
925b12823e fs: add do_ioctl_nothrow and fsid methods to btrfs fs info
Enable use of the ioctl to probe whether two fds refer to the same btrfs,
without throwing an exception.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
561e604edc seeker: turn off debug logging
The debug log is only revealed when something goes wrong, but it is
created and discarded every time `seek_backward` is called, and it
is quite CPU-intensive.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
30cd375d03 readahead: clean up the code, update docs
Remove dubious comments and #if 0 section.  Document new event counters,
and add one for read failures.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
48b7fbda9c progress: adjust minimum thresholds for ETA to 10 seconds and 1 GiB of data
1% is a lot of data on a petabyte filesystem, and a long time to wait for an
ETA.

After 1 GiB we should have some idea of how fast we're reading the data.
Increase the time to 10 seconds to avoid a nonsense result just after a scan
starts.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
85aba7b695 openat2: #include <linux/types.h> so we can know __u64
Alternative implementations could use `uint64_t` instead, from `cstdint`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 17:02:19 -05:00
Zygo Blaxell
de38b46dd8 scripts/beesd: harden the mount options
* `nodev`: This reduces rename attack surface by preventing bees from
 opening any device file on the target filesystem.

 * `noexec`: This prevents access to the mount point from being leveraged
 to execute setuid binaries, or execute anything at all through the
 mount point.

These options are not required because they duplicate features in the
bees binary (assuming that the mount namespace remains private):

 * `noatime`: bees always opens every file with `O_NOATIME`, making
 this option redundant.

 * `nosymfollow`: bees uses `openat2` on kernels 5.6 and later with
 flags that prevent symlink attacks.  `nosymfollow` was introduced in
 kernel 5.10, so every kernel that can do `nosymfollow` can already do
 `openat2`.  Also, historically, `$BEESHOME` can be a relative path with
 symlinks in any path component except the last one, and `nosymfollow`
 doesn't allow that.

Between `openat2` and `nodev`, all symlink attacks are prevented, and
rename attacks cannot be used to force bees to open a device file.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 01:00:41 -05:00
Zygo Blaxell
0abf6ebb3d scripts/beesd: no need for $BEESHOME to be a subvol
We _recommend_ that `$BEESHOME` should be a subvol, and we'll create a
subvol if no directory exists; however, there's no reason to reject an
existing plain directory if the user chooses to use one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 00:43:13 -05:00
Kai Krakow
360ce7e125 scripts/beesd: Unshare namespace without systemd
If starting the beesd script without systemd, the mount point won't
automatically unmount if the script is cancelled with ctrl+c.

Fixes: https://github.com/Zygo/bees/issues/281
Signed-off-by: Kai Krakow <kai@kaishome.de>
2025-01-20 00:05:57 -05:00
Zygo Blaxell
ad11db2ee1 openat2: supply the missing definitions for building with old headers and new kernel
Apparently Ubuntu 20 has upgraded to kernel 5.15, but still builds things
with 5.4 headers.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:20:06 -05:00
Zygo Blaxell
874832dc58 openat2: log a warning when we fall back to openat
This should occur only once per run, but it's worth leaving a note
that it has happened.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:19:42 -05:00
Zygo Blaxell
5fe89d85c3 extent scan: make sure we run every extent crawler once per transaction
There's a pathological case where all of the extent scan crawlers except
one are at the end of a crawl cycle, but the one crawler that is still
running is keeping the Task queue full.  The result is that bees never
starts the other extent scan crawlers, because the queue is always
full at the instant a new transid triggers the start of a new scan.
That's bad because it will result in bees falling behind when new data
from the inactive size tiers appears.

To fix this, check for throttling _after_ creating at least one scan task
in each crawler.  That will keep the crawlers running, and possibly allow
them to claw back some space in the Task queue.  It slightly overcommits
the Task queue, so there will be a few more Tasks than nominally allowed.

Also (re)introduce some hysteresis in the queue size limit and reduce it
a little, so that bees isn't continually stopping and restarting crawls
every time one task is created or completed, and so that we stay under
the configured Task limit despite overcommitting.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:19:42 -05:00
Zygo Blaxell
a2b3e1e0c2 log: demote a lot of BEESLOGWARN to higher verbosity levels
Toxic extent workarounds are going away because the underlying kernel
bugs have been fixed.  They are no longer worthy of spamming non-developer
logs.

INO_PATHS can return no paths if an inode has been deleted.  It doesn't
need a log message at all, much less one at WARN level.

Dedupe failure can be INFO, the same level as dedupe itself, especially
since the "NO dedupe" message doesn't mention what was [not] deduped.

Inspired by Kai Krakow's "context: demote "abandoned toxic match" to
debug log level".

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 01:08:28 -05:00
Kai Krakow
aaec931081 context: demote "abandoned toxic match" to debug log level
This log message creates a overwhelmingly lot of messages in the system
journal, leading to write-back flushing storms under high activity. As
it is a work-around message, it is probably only useful to developers,
thus demote to debug level.

This fixes latency spikes in desktop usage after adding a lot of new
files, especially since systemd-journal starts to flush caches if it
sees memory pressure.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2025-01-19 00:59:22 -05:00
23 changed files with 1144 additions and 622 deletions

View File

@@ -55,6 +55,7 @@ These bugs are particularly popular among bees users, though not all are specifi
| 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.4.125, 5.10.43, 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
| - | 5.11 | tree mod log issue #5 | 4.4.263, 4.9.263, 4.14.227, 4.19.183, 5.4.108, 5.10.26, 5.11.9, 5.12 and later | dbcc7d57bffc btrfs: fix race when cloning extent buffer during rewind of an old root
| - | 5.12 | tree mod log issue #6 | 4.14.233, 4.19.191, 5.4.118, 5.10.36, 5.11.20, 5.12.3, 5.13 and later | f9690f426b21 btrfs: fix race when picking most recent mod log operation for an old root
| 5.11 | 5.12 | subvols marked for deletion with `btrfs sub del` become permanently undeletable ("ghost" subvols) | 5.12 stopped creation of new ghost subvols | Partially fixed in 8d488a8c7ba2 btrfs: fix subvolume/snapshot deletion not triggered on mount. Qu wrote a [patch](https://github.com/adam900710/linux/commit/9de990fcc8864c376eb28aa7482c54321f94acd4) to allow `btrfs sub del -i` to remove "ghost" subvols, but it was never merged upstream.
| 4.15 | 5.16 | spurious warnings from `fs/fs-writeback.c` when `flushoncommit` is enabled | 5.15.27, 5.16.13, 5.17 and later | a0f0cf8341e3 btrfs: get rid of warning on transaction commit when using flushoncommit
| - | 5.17 | crash during device removal can make filesystem unmountable | 5.15.54, 5.16.20, 5.17.3, 5.18 and later | bbac58698a55 btrfs: remove device item and update super block in the same transaction
| - | 5.18 | wrong superblock num_devices makes filesystem unmountable | 4.14.283, 4.19.247, 5.4.198, 5.10.121, 5.15.46, 5.17.14, 5.18.3, 5.19 and later | d201238ccd2f btrfs: repair super block num_devices automatically

View File

@@ -120,13 +120,14 @@ The `crawl` event group consists of operations related to scanning btrfs trees t
* `crawl_again`: An inode crawl was restarted because the extent was already locked by another running crawl.
* `crawl_blacklisted`: An extent was not scanned because it belongs to a blacklisted file.
* `crawl_create`: A new subvol or extent crawler was created.
* `crawl_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
* `crawl_done`: One pass over a subvol was completed.
* `crawl_discard`: An extent that didn't match the crawler's size tier was discarded.
* `crawl_discard_high`: An extent that was too large for the crawler's size tier was discarded.
* `crawl_discard_low`: An extent that was too small for the crawler's size tier was discarded.
* `crawl_empty`: A `TREE_SEARCH_V2` ioctl call failed or returned an empty set (usually because all data in the subvol was scanned).
* `crawl_extent`: The extent crawler queued all references to an extent for processing.
* `crawl_fail`: A `TREE_SEARCH_V2` ioctl call failed.
* `crawl_flop`: Small extent items were not skipped because the next extent started at or before the end of the previous extent.
* `crawl_gen_high`: An extent item in the search results refers to an extent that is newer than the current crawl's `max_transid` allows.
* `crawl_gen_low`: An extent item in the search results refers to an extent that is older than the current crawl's `min_transid` allows.
* `crawl_hole`: An extent item in the search results refers to a hole.
@@ -138,6 +139,8 @@ The `crawl` event group consists of operations related to scanning btrfs trees t
* `crawl_prealloc`: An extent item in the search results refers to a `PREALLOC` extent.
* `crawl_push`: An extent item in the search results is suitable for scanning and deduplication.
* `crawl_scan`: An extent item in the search results is submitted to `BeesContext::scan_forward` for scanning and deduplication.
* `crawl_skip`: Small extent items were skipped because no extent of sufficient size was found within the minimum search distance.
* `crawl_skip_ms`: Time spent skipping small extent items.
* `crawl_search`: A `TREE_SEARCH_V2` ioctl call was successful.
* `crawl_throttled`: Extent scan created too many work queue items and was prevented from creating any more.
* `crawl_tree_block`: Extent scan found and skipped a metadata tree block.
@@ -281,11 +284,14 @@ The `progress` event group consists of events related to progress estimation.
readahead
---------
The `readahead` event group consists of events related to calls to `posix_fadvise`.
The `readahead` event group consists of events related to data prefetching (formerly calls to `posix_fadvise` or `readahead`, but now emulated in userspace).
* `readahead_bytes`: Number of bytes prefetched.
* `readahead_count`: Number of read calls.
* `readahead_clear`: Number of times the duplicate read cache was cleared.
* `readahead_skip`: Number of times a duplicate read was identified in the cache and skipped.
* `readahead_fail`: Number of read errors during prefetch.
* `readahead_ms`: Total time spent emulating readahead in user-space (kernel readahead is not measured).
* `readahead_skip`: Number of times a duplicate read was identified in the cache and skipped.
* `readahead_unread_ms`: Total time spent running `posix_fadvise(..., POSIX_FADV_DONTNEED)`.
replacedst

View File

@@ -173,34 +173,42 @@ namespace crucible {
void get_sums(uint64_t logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t count)> output);
};
/// Fetch extent items from extent tree
/// Fetch extent items from extent tree.
/// Does not filter out metadata! See BtrfsDataExtentTreeFetcher for that.
class BtrfsExtentItemFetcher : public BtrfsTreeObjectFetcher {
public:
BtrfsExtentItemFetcher(const Fd &fd);
};
/// Fetch extent refs from an inode
/// Fetch extent refs from an inode. Caller must set the tree and objectid.
class BtrfsExtentDataFetcher : public BtrfsTreeOffsetFetcher {
public:
BtrfsExtentDataFetcher(const Fd &fd);
};
/// Fetch inodes from a subvol
class BtrfsFsTreeFetcher : public BtrfsTreeObjectFetcher {
public:
BtrfsFsTreeFetcher(const Fd &fd, uint64_t subvol);
};
/// Fetch raw inode items
class BtrfsInodeFetcher : public BtrfsTreeObjectFetcher {
public:
BtrfsInodeFetcher(const Fd &fd);
BtrfsTreeItem stat(uint64_t subvol, uint64_t inode);
};
/// Fetch a root (subvol) item
class BtrfsRootFetcher : public BtrfsTreeObjectFetcher {
public:
BtrfsRootFetcher(const Fd &fd);
BtrfsTreeItem root(uint64_t subvol);
BtrfsTreeItem root_backref(uint64_t subvol);
};
/// Fetch data extent items from extent tree, skipping metadata-only block groups
class BtrfsDataExtentTreeFetcher : public BtrfsExtentItemFetcher {
BtrfsTreeItem m_current_bg;
BtrfsTreeOffsetFetcher m_chunk_tree;
protected:
virtual void next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr) override;
public:
BtrfsDataExtentTreeFetcher(const Fd &fd);
};
}

View File

@@ -78,9 +78,6 @@ enum btrfs_compression_type {
#define BTRFS_SHARED_BLOCK_REF_KEY 182
#define BTRFS_SHARED_DATA_REF_KEY 184
#define BTRFS_BLOCK_GROUP_ITEM_KEY 192
#define BTRFS_FREE_SPACE_INFO_KEY 198
#define BTRFS_FREE_SPACE_EXTENT_KEY 199
#define BTRFS_FREE_SPACE_BITMAP_KEY 200
#define BTRFS_DEV_EXTENT_KEY 204
#define BTRFS_DEV_ITEM_KEY 216
#define BTRFS_CHUNK_ITEM_KEY 228
@@ -97,6 +94,18 @@ enum btrfs_compression_type {
#endif
#ifndef BTRFS_FREE_SPACE_INFO_KEY
#define BTRFS_FREE_SPACE_INFO_KEY 198
#define BTRFS_FREE_SPACE_EXTENT_KEY 199
#define BTRFS_FREE_SPACE_BITMAP_KEY 200
#define BTRFS_FREE_SPACE_OBJECTID -11ULL
#endif
#ifndef BTRFS_BLOCK_GROUP_RAID1C4
#define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
#define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
#endif
#ifndef BTRFS_DEFRAG_RANGE_START_IO
// For some reason uapi has BTRFS_DEFRAG_RANGE_COMPRESS and

View File

@@ -201,11 +201,13 @@ namespace crucible {
static thread_local size_t s_calls;
static thread_local size_t s_loops;
static thread_local size_t s_loops_empty;
static thread_local shared_ptr<ostream> s_debug_ostream;
};
ostream & operator<<(ostream &os, const btrfs_ioctl_search_key &key);
ostream & operator<<(ostream &os, const BtrfsIoctlSearchKey &key);
string btrfs_chunk_type_ntoa(uint64_t type);
string btrfs_search_type_ntoa(unsigned type);
string btrfs_search_objectid_ntoa(uint64_t objectid);
string btrfs_compress_type_ntoa(uint8_t type);
@@ -246,9 +248,11 @@ namespace crucible {
struct BtrfsIoctlFsInfoArgs : public btrfs_ioctl_fs_info_args_v3 {
BtrfsIoctlFsInfoArgs();
void do_ioctl(int fd);
bool do_ioctl_nothrow(int fd);
uint16_t csum_type() const;
uint16_t csum_size() const;
uint64_t generation() const;
vector<uint8_t> fsid() const;
};
ostream & operator<<(ostream &os, const BtrfsIoctlFsInfoArgs &a);

View File

@@ -13,7 +13,7 @@ namespace crucible {
hexdump(ostream &os, const V &v)
{
const auto v_size = v.size();
const uint8_t* const v_data = reinterpret_cast<uint8_t*>(v.data());
const uint8_t* const v_data = reinterpret_cast<const uint8_t*>(v.data());
os << "V { size = " << v_size << ", data:\n";
for (size_t i = 0; i < v_size; i += 8) {
string hex, ascii;

View File

@@ -1,11 +1,46 @@
#ifndef CRUCIBLE_OPENAT2_H
#define CRUCIBLE_OPENAT2_H
#include <cstdlib>
// Compatibility for building on old libc for new kernel
#include <linux/version.h>
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0)
#include <linux/openat2.h>
#include <fcntl.h>
#include <sys/syscall.h>
#include <unistd.h>
#else
#include <linux/types.h>
#ifndef RESOLVE_NO_XDEV
#define RESOLVE_NO_XDEV 1
// RESOLVE_NO_XDEV was there from the beginning of openat2,
// so if that's missing, so is open_how
struct open_how {
__u64 flags;
__u64 mode;
__u64 resolve;
};
#endif
#ifndef RESOLVE_NO_MAGICLINKS
#define RESOLVE_NO_MAGICLINKS 2
#endif
#ifndef RESOLVE_NO_SYMLINKS
#define RESOLVE_NO_SYMLINKS 4
#endif
#ifndef RESOLVE_BENEATH
#define RESOLVE_BENEATH 8
#endif
#ifndef RESOLVE_IN_ROOT
#define RESOLVE_IN_ROOT 16
#endif
#endif // Linux version >= v5.6
extern "C" {

View File

@@ -6,23 +6,23 @@
#include <algorithm>
#include <limits>
#include <cstdint>
#if 1
// Debug stream
#include <memory>
#include <iostream>
#include <sstream>
#define DINIT(__x) __x
#define DLOG(__x) do { logs << __x << std::endl; } while (false)
#define DOUT(__err) do { __err << logs.str(); } while (false)
#else
#define DINIT(__x) do {} while (false)
#define DLOG(__x) do {} while (false)
#define DOUT(__x) do {} while (false)
#endif
#include <cstdint>
namespace crucible {
using namespace std;
extern thread_local shared_ptr<ostream> tl_seeker_debug_str;
#define SEEKER_DEBUG_LOG(__x) do { \
if (tl_seeker_debug_str) { \
(*tl_seeker_debug_str) << __x << "\n"; \
} \
} while (false)
// Requirements for Container<Pos> Fetch(Pos lower, Pos upper):
// - fetches objects in Pos order, starting from lower (must be >= lower)
// - must return upper if present, may or may not return objects after that
@@ -49,113 +49,108 @@ namespace crucible {
Pos
seek_backward(Pos const target_pos, Fetch fetch, Pos min_step = 1, size_t max_loops = numeric_limits<size_t>::max())
{
DINIT(ostringstream logs);
try {
static const Pos end_pos = numeric_limits<Pos>::max();
// TBH this probably won't work if begin_pos != 0, i.e. any signed type
static const Pos begin_pos = numeric_limits<Pos>::min();
// Run a binary search looking for the highest key below target_pos.
// Initial upper bound of the search is target_pos.
// Find initial lower bound by doubling the size of the range until a key below target_pos
// is found, or the lower bound reaches the beginning of the search space.
// If the lower bound search reaches the beginning of the search space without finding a key,
// return the beginning of the search space; otherwise, perform a binary search between
// the bounds now established.
Pos lower_bound = 0;
Pos upper_bound = target_pos;
bool found_low = false;
Pos probe_pos = target_pos;
// We need one loop for each bit of the search space to find the lower bound,
// one loop for each bit of the search space to find the upper bound,
// and one extra loop to confirm the boundary is correct.
for (size_t loop_count = min(numeric_limits<Pos>::digits * size_t(2) + 1, max_loops); loop_count; --loop_count) {
DLOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
auto result = fetch(probe_pos, target_pos);
const Pos low_pos = result.empty() ? end_pos : *result.begin();
const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
DLOG(" = " << low_pos << ".." << high_pos);
// check for correct behavior of the fetch function
THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
if (!found_low) {
// if target_pos == end_pos then we will find it in every empty result set,
// so in that case we force the lower bound to be lower than end_pos
if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
// found a lower bound, set the low bound there and switch to binary search
found_low = true;
lower_bound = low_pos;
DLOG("found_low = true, lower_bound = " << lower_bound);
} else {
// still looking for lower bound
// if probe_pos was begin_pos then we can stop with no result
if (probe_pos == begin_pos) {
DLOG("return: probe_pos == begin_pos " << begin_pos);
return begin_pos;
}
// double the range size, or use the distance between objects found so far
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
// already checked low_pos <= high_pos above
const Pos want_delta = max(upper_bound - probe_pos, min_step);
// avoid underflowing the beginning of the search space
const Pos have_delta = min(want_delta, probe_pos - begin_pos);
THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
// move probe and try again
probe_pos = probe_pos - have_delta;
DLOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
continue;
static const Pos end_pos = numeric_limits<Pos>::max();
// TBH this probably won't work if begin_pos != 0, i.e. any signed type
static const Pos begin_pos = numeric_limits<Pos>::min();
// Run a binary search looking for the highest key below target_pos.
// Initial upper bound of the search is target_pos.
// Find initial lower bound by doubling the size of the range until a key below target_pos
// is found, or the lower bound reaches the beginning of the search space.
// If the lower bound search reaches the beginning of the search space without finding a key,
// return the beginning of the search space; otherwise, perform a binary search between
// the bounds now established.
Pos lower_bound = 0;
Pos upper_bound = target_pos;
bool found_low = false;
Pos probe_pos = target_pos;
// We need one loop for each bit of the search space to find the lower bound,
// one loop for each bit of the search space to find the upper bound,
// and one extra loop to confirm the boundary is correct.
for (size_t loop_count = min((1 + numeric_limits<Pos>::digits) * size_t(2), max_loops); loop_count; --loop_count) {
SEEKER_DEBUG_LOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
auto result = fetch(probe_pos, target_pos);
const Pos low_pos = result.empty() ? end_pos : *result.begin();
const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
SEEKER_DEBUG_LOG(" = " << low_pos << ".." << high_pos);
// check for correct behavior of the fetch function
THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
if (!found_low) {
// if target_pos == end_pos then we will find it in every empty result set,
// so in that case we force the lower bound to be lower than end_pos
if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
// found a lower bound, set the low bound there and switch to binary search
found_low = true;
lower_bound = low_pos;
SEEKER_DEBUG_LOG("found_low = true, lower_bound = " << lower_bound);
} else {
// still looking for lower bound
// if probe_pos was begin_pos then we can stop with no result
if (probe_pos == begin_pos) {
SEEKER_DEBUG_LOG("return: probe_pos == begin_pos " << begin_pos);
return begin_pos;
}
// double the range size, or use the distance between objects found so far
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
// already checked low_pos <= high_pos above
const Pos want_delta = max(upper_bound - probe_pos, min_step);
// avoid underflowing the beginning of the search space
const Pos have_delta = min(want_delta, probe_pos - begin_pos);
THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
// move probe and try again
probe_pos = probe_pos - have_delta;
SEEKER_DEBUG_LOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
continue;
}
if (low_pos <= target_pos && target_pos <= high_pos) {
// have keys on either side of target_pos in result
// search from the high end until we find the highest key below target
for (auto i = result.rbegin(); i != result.rend(); ++i) {
// more correctness checking for fetch
THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
if (*i <= target_pos) {
DLOG("return: *i " << *i << " <= target_pos " << target_pos);
return *i;
}
}
// if the list is empty then low_pos = high_pos = end_pos
// if target_pos = end_pos also, then we will execute the loop
// above but not find any matching entries.
THROW_CHECK0(runtime_error, result.empty());
}
if (target_pos <= low_pos) {
// results are all too high, so probe_pos..low_pos is too high
// lower the high bound to the probe pos
upper_bound = probe_pos;
DLOG("upper_bound = probe_pos " << probe_pos);
}
if (high_pos < target_pos) {
// results are all too low, so probe_pos..high_pos is too low
// raise the low bound to the high_pos
DLOG("lower_bound = high_pos " << high_pos);
lower_bound = high_pos;
}
// compute a new probe pos at the middle of the range and try again
// we can't have a zero-size range here because we would not have set found_low yet
THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
const Pos delta = (upper_bound - lower_bound) / 2;
probe_pos = lower_bound + delta;
if (delta < 1) {
// nothing can exist in the range (lower_bound, upper_bound)
// and an object is known to exist at lower_bound
DLOG("return: probe_pos == lower_bound " << lower_bound);
return lower_bound;
}
THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
DLOG("loop: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
}
THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
"found_low " << found_low);
} catch (...) {
DOUT(cerr);
throw;
if (low_pos <= target_pos && target_pos <= high_pos) {
// have keys on either side of target_pos in result
// search from the high end until we find the highest key below target
for (auto i = result.rbegin(); i != result.rend(); ++i) {
// more correctness checking for fetch
THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
if (*i <= target_pos) {
SEEKER_DEBUG_LOG("return: *i " << *i << " <= target_pos " << target_pos);
return *i;
}
}
// if the list is empty then low_pos = high_pos = end_pos
// if target_pos = end_pos also, then we will execute the loop
// above but not find any matching entries.
THROW_CHECK0(runtime_error, result.empty());
}
if (target_pos <= low_pos) {
// results are all too high, so probe_pos..low_pos is too high
// lower the high bound to the probe pos, low_pos cannot be lower
SEEKER_DEBUG_LOG("upper_bound = probe_pos " << probe_pos);
upper_bound = probe_pos;
}
if (high_pos < target_pos) {
// results are all too low, so probe_pos..high_pos is too low
// raise the low bound to high_pos but not above upper_bound
const auto next_pos = min(high_pos, upper_bound);
SEEKER_DEBUG_LOG("lower_bound = next_pos " << next_pos);
lower_bound = next_pos;
}
// compute a new probe pos at the middle of the range and try again
// we can't have a zero-size range here because we would not have set found_low yet
THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
const Pos delta = (upper_bound - lower_bound) / 2;
probe_pos = lower_bound + delta;
if (delta < 1) {
// nothing can exist in the range (lower_bound, upper_bound)
// and an object is known to exist at lower_bound
SEEKER_DEBUG_LOG("return: probe_pos == lower_bound " << lower_bound);
return lower_bound;
}
THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
SEEKER_DEBUG_LOG("loop bottom: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
}
THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
"found_low " << found_low);
}
}

View File

@@ -17,6 +17,7 @@ CRUCIBLE_OBJS = \
openat2.o \
path.o \
process.o \
seeker.o \
string.o \
table.o \
task.o \

View File

@@ -5,6 +5,12 @@
#include "crucible/hexdump.h"
#include "crucible/seeker.h"
#define CRUCIBLE_BTRFS_TREE_DEBUG(x) do { \
if (BtrfsIoctlSearchKey::s_debug_ostream) { \
(*BtrfsIoctlSearchKey::s_debug_ostream) << x; \
} \
} while (false)
namespace crucible {
using namespace std;
@@ -355,6 +361,7 @@ namespace crucible {
BtrfsTreeItem
BtrfsTreeFetcher::at(uint64_t logical)
{
CRUCIBLE_BTRFS_TREE_DEBUG("at " << logical);
BtrfsIoctlSearchKey &sk = m_sk;
fill_sk(sk, logical);
// Exact match, should return 0 or 1 items
@@ -397,53 +404,59 @@ namespace crucible {
BtrfsTreeFetcher::rlower_bound(uint64_t logical)
{
#if 0
#define BTFRLB_DEBUG(x) do { cerr << x; } while (false)
static bool btfrlb_debug = getenv("BTFLRB_DEBUG");
#define BTFRLB_DEBUG(x) do { if (btfrlb_debug) cerr << x; } while (false)
#else
#define BTFRLB_DEBUG(x) do { } while (false)
#define BTFRLB_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
#endif
BtrfsTreeItem closest_item;
uint64_t closest_logical = 0;
BtrfsIoctlSearchKey &sk = m_sk;
size_t loops = 0;
BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << endl);
seek_backward(scale_logical(logical), [&](uint64_t lower_bound, uint64_t upper_bound) {
BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << " in tree " << tree() << endl);
seek_backward(scale_logical(logical), [&](uint64_t const lower_bound, uint64_t const upper_bound) {
++loops;
fill_sk(sk, unscale_logical(min(scaled_max_logical(), lower_bound)));
set<uint64_t> rv;
bool too_far = false;
do {
sk.nr_items = 4;
sk.do_ioctl(fd());
BTFRLB_DEBUG("fetch: loop " << loops << " lower_bound..upper_bound " << to_hex(lower_bound) << ".." << to_hex(upper_bound));
for (auto &i : sk.m_result) {
next_sk(sk, i);
const auto this_logical = hdr_logical(i);
const auto scaled_hdr_logical = scale_logical(this_logical);
BTFRLB_DEBUG(" " << to_hex(scaled_hdr_logical));
if (hdr_match(i)) {
if (this_logical <= logical && this_logical > closest_logical) {
closest_logical = this_logical;
closest_item = i;
}
BTFRLB_DEBUG("(match)");
rv.insert(scaled_hdr_logical);
}
if (scaled_hdr_logical > upper_bound || hdr_stop(i)) {
if (scaled_hdr_logical >= upper_bound) {
BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
}
if (hdr_stop(i)) {
rv.insert(numeric_limits<uint64_t>::max());
BTFRLB_DEBUG("(stop)");
}
// If hdr_stop or !hdr_match, don't inspect the item
if (hdr_stop(i)) {
too_far = true;
rv.insert(numeric_limits<uint64_t>::max());
BTFRLB_DEBUG("(stop)");
break;
} else {
BTFRLB_DEBUG("(cont'd)");
}
if (!hdr_match(i)) {
BTFRLB_DEBUG("(no match)");
continue;
}
const auto this_logical = hdr_logical(i);
BTFRLB_DEBUG(" " << to_hex(this_logical) << " " << i);
const auto scaled_hdr_logical = scale_logical(this_logical);
BTFRLB_DEBUG(" " << "(match)");
if (scaled_hdr_logical > upper_bound) {
too_far = true;
BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
break;
}
if (this_logical <= logical && this_logical > closest_logical) {
closest_logical = this_logical;
closest_item = i;
BTFRLB_DEBUG("(closest)");
}
rv.insert(scaled_hdr_logical);
BTFRLB_DEBUG("(cont'd)");
}
BTFRLB_DEBUG(endl);
// We might get a search result that contains only non-matching items.
// Keep looping until we find any matching item or we run out of tree.
} while (rv.empty() && !sk.m_result.empty());
} while (!too_far && rv.empty() && !sk.m_result.empty());
return rv;
}, scale_logical(lookbehind_size()));
return closest_item;
@@ -474,6 +487,7 @@ namespace crucible {
BtrfsTreeItem
BtrfsTreeFetcher::next(uint64_t logical)
{
CRUCIBLE_BTRFS_TREE_DEBUG("next " << logical);
const auto scaled_logical = scale_logical(logical);
if (scaled_logical + 1 > scaled_max_logical()) {
return BtrfsTreeItem();
@@ -484,6 +498,7 @@ namespace crucible {
BtrfsTreeItem
BtrfsTreeFetcher::prev(uint64_t logical)
{
CRUCIBLE_BTRFS_TREE_DEBUG("prev " << logical);
const auto scaled_logical = scale_logical(logical);
if (scaled_logical < 1) {
return BtrfsTreeItem();
@@ -568,9 +583,10 @@ namespace crucible {
BtrfsCsumTreeFetcher::get_sums(uint64_t const logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t bytes)> output)
{
#if 0
#define BCTFGS_DEBUG(x) do { cerr << x; } while (false)
static bool bctfgs_debug = getenv("BCTFGS_DEBUG");
#define BCTFGS_DEBUG(x) do { if (bctfgs_debug) cerr << x; } while (false)
#else
#define BCTFGS_DEBUG(x) do { } while (false)
#define BCTFGS_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
#endif
const uint64_t logical_end = logical + count * block_size();
BtrfsTreeItem bti = rlower_bound(logical);
@@ -662,14 +678,6 @@ namespace crucible {
type(BTRFS_EXTENT_DATA_KEY);
}
BtrfsFsTreeFetcher::BtrfsFsTreeFetcher(const Fd &new_fd, uint64_t subvol) :
BtrfsTreeObjectFetcher(new_fd)
{
tree(subvol);
type(BTRFS_EXTENT_DATA_KEY);
scale_size(1);
}
BtrfsInodeFetcher::BtrfsInodeFetcher(const Fd &fd) :
BtrfsTreeObjectFetcher(fd)
{
@@ -693,18 +701,86 @@ namespace crucible {
BtrfsTreeObjectFetcher(fd)
{
tree(BTRFS_ROOT_TREE_OBJECTID);
type(BTRFS_ROOT_ITEM_KEY);
scale_size(1);
}
BtrfsTreeItem
BtrfsRootFetcher::root(uint64_t subvol)
BtrfsRootFetcher::root(const uint64_t subvol)
{
const auto my_type = BTRFS_ROOT_ITEM_KEY;
type(my_type);
const auto item = at(subvol);
if (!!item) {
THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
THROW_CHECK2(runtime_error, item.type(), BTRFS_ROOT_ITEM_KEY, item.type() == BTRFS_ROOT_ITEM_KEY);
THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
}
return item;
}
BtrfsTreeItem
BtrfsRootFetcher::root_backref(const uint64_t subvol)
{
const auto my_type = BTRFS_ROOT_BACKREF_KEY;
type(my_type);
const auto item = at(subvol);
if (!!item) {
THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
}
return item;
}
BtrfsDataExtentTreeFetcher::BtrfsDataExtentTreeFetcher(const Fd &fd) :
BtrfsExtentItemFetcher(fd),
m_chunk_tree(fd)
{
tree(BTRFS_EXTENT_TREE_OBJECTID);
type(BTRFS_EXTENT_ITEM_KEY);
m_chunk_tree.tree(BTRFS_CHUNK_TREE_OBJECTID);
m_chunk_tree.type(BTRFS_CHUNK_ITEM_KEY);
m_chunk_tree.objectid(BTRFS_FIRST_CHUNK_TREE_OBJECTID);
}
void
BtrfsDataExtentTreeFetcher::next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr)
{
key.min_type = key.max_type = type();
key.max_objectid = key.max_offset = numeric_limits<uint64_t>::max();
key.min_offset = 0;
key.min_objectid = hdr.objectid;
const auto step = scale_size();
if (key.min_objectid < numeric_limits<uint64_t>::max() - step) {
key.min_objectid += step;
} else {
key.min_objectid = numeric_limits<uint64_t>::max();
}
// If we're still in our current block group, check here
if (!!m_current_bg) {
const auto bg_begin = m_current_bg.offset();
const auto bg_end = bg_begin + m_current_bg.chunk_length();
// If we are still in our current block group, return early
if (key.min_objectid >= bg_begin && key.min_objectid < bg_end) return;
}
// We don't have a current block group or we're out of range
// Find the chunk that this bytenr belongs to
m_current_bg = m_chunk_tree.rlower_bound(key.min_objectid);
// Make sure it's a data block group
while (!!m_current_bg) {
// Data block group, stop here
if (m_current_bg.chunk_type() & BTRFS_BLOCK_GROUP_DATA) break;
// Not a data block group, skip to end
key.min_objectid = m_current_bg.offset() + m_current_bg.chunk_length();
m_current_bg = m_chunk_tree.lower_bound(key.min_objectid);
}
if (!m_current_bg) {
// Ran out of data block groups, stop here
return;
}
// Check to see if bytenr is in the current data block group
const auto bg_begin = m_current_bg.offset();
if (key.min_objectid < bg_begin) {
// Move forward to start of data block group
key.min_objectid = bg_begin;
}
}
}

View File

@@ -757,6 +757,7 @@ namespace crucible {
thread_local size_t BtrfsIoctlSearchKey::s_calls = 0;
thread_local size_t BtrfsIoctlSearchKey::s_loops = 0;
thread_local size_t BtrfsIoctlSearchKey::s_loops_empty = 0;
thread_local shared_ptr<ostream> BtrfsIoctlSearchKey::s_debug_ostream;
bool
BtrfsIoctlSearchKey::do_ioctl_nothrow(int fd)
@@ -776,6 +777,9 @@ namespace crucible {
ioctl_ptr = ioctl_arg.get<btrfs_ioctl_search_args_v2>();
ioctl_ptr->key = static_cast<const btrfs_ioctl_search_key&>(*this);
ioctl_ptr->buf_size = buf_size;
if (s_debug_ostream) {
(*s_debug_ostream) << "bisk " << (ioctl_ptr->key) << "\n";
}
// Don't bother supporting V1. Kernels that old have other problems.
int rv = ioctl(fd, BTRFS_IOC_TREE_SEARCH_V2, ioctl_arg.data());
++s_calls;
@@ -881,6 +885,26 @@ namespace crucible {
}
}
string
btrfs_chunk_type_ntoa(uint64_t type)
{
static const bits_ntoa_table table[] = {
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DATA),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_METADATA),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_SYSTEM),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DUP),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID0),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID10),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C3),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C4),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID5),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID6),
NTOA_TABLE_ENTRY_END()
};
return bits_ntoa(type, table);
}
string
btrfs_search_type_ntoa(unsigned type)
{
@@ -908,15 +932,9 @@ namespace crucible {
NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_BLOCK_REF_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_DATA_REF_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_BLOCK_GROUP_ITEM_KEY),
#ifdef BTRFS_FREE_SPACE_INFO_KEY
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_INFO_KEY),
#endif
#ifdef BTRFS_FREE_SPACE_EXTENT_KEY
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_EXTENT_KEY),
#endif
#ifdef BTRFS_FREE_SPACE_BITMAP_KEY
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_BITMAP_KEY),
#endif
NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_EXTENT_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_ITEM_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_CHUNK_ITEM_KEY),
@@ -948,9 +966,7 @@ namespace crucible {
NTOA_TABLE_ENTRY_ENUM(BTRFS_CSUM_TREE_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_QUOTA_TREE_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_UUID_TREE_OBJECTID),
#ifdef BTRFS_FREE_SPACE_TREE_OBJECTID
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_TREE_OBJECTID),
#endif
NTOA_TABLE_ENTRY_ENUM(BTRFS_BALANCE_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_ORPHAN_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_TREE_LOG_OBJECTID),
@@ -1138,11 +1154,17 @@ namespace crucible {
{
}
void
BtrfsIoctlFsInfoArgs::do_ioctl(int fd)
bool
BtrfsIoctlFsInfoArgs::do_ioctl_nothrow(int const fd)
{
btrfs_ioctl_fs_info_args_v3 *p = static_cast<btrfs_ioctl_fs_info_args_v3 *>(this);
if (ioctl(fd, BTRFS_IOC_FS_INFO, p)) {
return 0 == ioctl(fd, BTRFS_IOC_FS_INFO, p);
}
void
BtrfsIoctlFsInfoArgs::do_ioctl(int const fd)
{
if (!do_ioctl_nothrow(fd)) {
THROW_ERRNO("BTRFS_IOC_FS_INFO: fd " << fd);
}
}
@@ -1159,6 +1181,13 @@ namespace crucible {
return this->btrfs_ioctl_fs_info_args_v3::csum_size;
}
vector<uint8_t>
BtrfsIoctlFsInfoArgs::fsid() const
{
const auto begin = btrfs_ioctl_fs_info_args_v3::fsid;
return vector<uint8_t>(begin, begin + BTRFS_FSID_SIZE);
}
uint64_t
BtrfsIoctlFsInfoArgs::generation() const
{

View File

@@ -1,5 +1,27 @@
#include "crucible/openat2.h"
#include <sys/syscall.h>
// Compatibility for building on old libc for new kernel
#if LINUX_VERSION_CODE < KERNEL_VERSION(5, 6, 0)
// Every arch that defines this uses 437, except Alpha, where 437 is
// mq_getsetattr.
#ifndef SYS_openat2
#ifdef __alpha__
#define SYS_openat2 547
#else
#define SYS_openat2 437
#endif
#endif
#endif // Linux version >= v5.6
#include <fcntl.h>
#include <unistd.h>
extern "C" {
int
@@ -7,7 +29,12 @@ __attribute__((weak))
openat2(int const dirfd, const char *const pathname, struct open_how *const how, size_t const size)
throw()
{
#ifdef SYS_openat2
return syscall(SYS_openat2, dirfd, pathname, how, size);
#else
errno = ENOSYS;
return -1;
#endif
}
};

7
lib/seeker.cc Normal file
View File

@@ -0,0 +1,7 @@
#include "crucible/seeker.h"
namespace crucible {
thread_local shared_ptr<ostream> tl_seeker_debug_str;
};

View File

@@ -1,5 +1,13 @@
#!/bin/bash
# if not called from systemd try to replicate mount unsharing on ctrl+c
# see: https://github.com/Zygo/bees/issues/281
if [ -z "${SYSTEMD_EXEC_PID}" -a -z "${UNSHARE_DONE}" ]; then
UNSHARE_DONE=true
export UNSHARE_DONE
exec unshare -m --propagation private -- "$0" "$@"
fi
## Helpful functions
INFO(){ echo "INFO:" "$@"; }
ERRO(){ echo "ERROR:" "$@"; exit 1; }
@@ -108,13 +116,11 @@ mkdir -p "$WORK_DIR" || exit 1
INFO "MOUNT DIR: $MNT_DIR"
mkdir -p "$MNT_DIR" || exit 1
mount --make-private -osubvolid=5 /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
mount --make-private -osubvolid=5,nodev,noexec /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
if [ ! -d "$BEESHOME" ]; then
INFO "Create subvol $BEESHOME for store bees data"
btrfs sub cre "$BEESHOME"
else
btrfs sub show "$BEESHOME" &> /dev/null || ERRO "$BEESHOME MUST BE A SUBVOL!"
fi
# Check DB size

View File

@@ -17,6 +17,7 @@ KillSignal=SIGTERM
MemoryAccounting=true
Nice=19
Restart=on-abnormal
RuntimeDirectoryMode=0700
RuntimeDirectory=bees
StartupCPUWeight=25
StartupIOWeight=25

View File

@@ -230,8 +230,10 @@ BeesContext::dedup(const BeesRangePair &brp_in)
BeesAddress first_addr(brp.first.fd(), brp.first.begin());
BeesAddress second_addr(brp.second.fd(), brp.second.begin());
if (first_addr.get_physical_or_zero() == second_addr.get_physical_or_zero()) {
BEESLOGTRACE("equal physical addresses in dedup");
const auto first_gpoz = first_addr.get_physical_or_zero();
const auto second_gpoz = second_addr.get_physical_or_zero();
if (first_gpoz == second_gpoz) {
BEESLOGDEBUG("equal physical addresses " << first_addr << " and " << second_addr << " in dedup");
BEESCOUNT(bug_dedup_same_physical);
}
@@ -259,7 +261,7 @@ BeesContext::dedup(const BeesRangePair &brp_in)
BEESCOUNTADD(dedup_bytes, brp.first.size());
} else {
BEESCOUNT(dedup_miss);
BEESLOGWARN("NO Dedup! " << brp);
BEESLOGINFO("NO Dedup! " << brp);
}
lock.reset();
@@ -373,7 +375,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
Extent::OBSCURED | Extent::PREALLOC
)) {
BEESCOUNT(scan_interesting);
BEESLOGWARN("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
BEESLOGINFO("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
}
if (e.flags() & Extent::HOLE) {
@@ -385,7 +387,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
if (e.flags() & Extent::PREALLOC) {
// Prealloc is all zero and we replace it with a hole.
// No special handling is required here. Nuke it and move on.
BEESLOGINFO("prealloc extent " << e);
BEESLOGINFO("prealloc extent " << e << " in " << bfr);
// Must not extend past EOF
auto extent_size = min(e.end(), bfr.file_size()) - e.begin();
// Must hold tmpfile until dedupe is done
@@ -534,7 +536,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
// Hash is toxic
if (found_addr.is_toxic()) {
BEESLOGWARN("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
BEESLOGDEBUG("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
// Don't push these back in because we'll never delete them.
// Extents may become non-toxic so give them a chance to expire.
// hash_table->push_front_hash_addr(hash, found_addr);
@@ -556,7 +558,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
BeesResolver resolved(m_ctx, found_addr);
// Toxic extents are really toxic
if (resolved.is_toxic()) {
BEESLOGWARN("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
BEESLOGDEBUG("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
BEESCOUNT(scan_toxic_match);
// Make sure we never see this hash again.
// It has become toxic since it was inserted into the hash table.
@@ -917,7 +919,7 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)
// Sanity check
if (bfr.begin() >= bfr.file_size()) {
BEESLOGWARN("past EOF: " << bfr);
BEESLOGDEBUG("past EOF: " << bfr);
BEESCOUNT(scanf_eof);
return false;
}

View File

@@ -797,7 +797,7 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
for (auto fp = madv_flags; fp->value; ++fp) {
BEESTOOLONG("madvise(" << fp->name << ")");
if (madvise(m_byte_ptr, m_size, fp->value)) {
BEESLOGWARN("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
BEESLOGNOTICE("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
}
}
@@ -811,8 +811,19 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
prefetch_loop();
});
// Blacklist might fail if the hash table is not stored on a btrfs
// Blacklist might fail if the hash table is not stored on a btrfs,
// or if it's on a _different_ btrfs
catch_all([&]() {
// Root is definitely a btrfs
BtrfsIoctlFsInfoArgs root_info;
root_info.do_ioctl(m_ctx->root_fd());
// Hash might not be a btrfs
BtrfsIoctlFsInfoArgs hash_info;
// If btrfs fs_info ioctl fails, it must be a different fs
if (!hash_info.do_ioctl_nothrow(m_fd)) return;
// If Hash is a btrfs, Root must be the same one
if (root_info.fsid() != hash_info.fsid()) return;
// Hash is on the same one, blacklist it
m_ctx->blacklist_insert(BeesFileId(m_fd));
});
}

File diff suppressed because it is too large Load Diff

View File

@@ -8,38 +8,32 @@ thread_local BeesTracer *BeesTracer::tl_next_tracer = nullptr;
thread_local bool BeesTracer::tl_first = true;
thread_local bool BeesTracer::tl_silent = false;
bool
exception_check()
{
#if __cplusplus >= 201703
static
bool
exception_check()
{
return uncaught_exceptions();
}
#else
static
bool
exception_check()
{
return uncaught_exception();
}
#endif
}
BeesTracer::~BeesTracer()
{
if (!tl_silent && exception_check()) {
if (tl_first) {
BEESLOGNOTICE("--- BEGIN TRACE --- exception ---");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE --- exception ---");
tl_first = false;
}
try {
m_func();
} catch (exception &e) {
BEESLOGNOTICE("Nested exception: " << e.what());
BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception: " << e.what());
} catch (...) {
BEESLOGNOTICE("Nested exception ...");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception ...");
}
if (!m_next_tracer) {
BEESLOGNOTICE("--- END TRACE --- exception ---");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- END TRACE --- exception ---");
}
}
tl_next_tracer = m_next_tracer;
@@ -49,7 +43,7 @@ BeesTracer::~BeesTracer()
}
}
BeesTracer::BeesTracer(function<void()> f, bool silent) :
BeesTracer::BeesTracer(const function<void()> &f, bool silent) :
m_func(f)
{
m_next_tracer = tl_next_tracer;
@@ -61,12 +55,12 @@ void
BeesTracer::trace_now()
{
BeesTracer *tp = tl_next_tracer;
BEESLOGNOTICE("--- BEGIN TRACE ---");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE ---");
while (tp) {
tp->m_func();
tp = tp->m_next_tracer;
}
BEESLOGNOTICE("--- END TRACE ---");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- END TRACE ---");
}
bool

View File

@@ -457,7 +457,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
}
if (found_toxic) {
BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
BEESCOUNT(pairbackward_toxic_hash);
break;
}
@@ -558,7 +558,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
}
if (found_toxic) {
BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
BEESCOUNT(pairforward_toxic_hash);
break;
}
@@ -572,7 +572,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
if (first.overlaps(second)) {
BEESLOGTRACE("after grow, first " << first << "\n\toverlaps " << second);
BEESLOGDEBUG("after grow, first " << first << "\n\toverlaps " << second);
BEESCOUNT(bug_grow_pair_overlaps);
}
@@ -674,7 +674,7 @@ BeesAddress::magic_check(uint64_t flags)
static const unsigned recognized_flags = compressed_flags | delalloc_flags | ignore_flags | unusable_flags;
if (flags & ~recognized_flags) {
BEESLOGTRACE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
BEESLOGNOTICE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
m_addr = UNUSABLE;
// maybe we throw here?
BEESCOUNT(addr_unrecognized);

View File

@@ -4,6 +4,7 @@
#include "crucible/process.h"
#include "crucible/string.h"
#include "crucible/task.h"
#include "crucible/uname.h"
#include <cctype>
#include <cmath>
@@ -11,17 +12,19 @@
#include <iostream>
#include <memory>
#include <regex>
#include <sstream>
// PRIx64
#include <inttypes.h>
#include <sched.h>
#include <sys/fanotify.h>
#include <linux/fs.h>
#include <sys/ioctl.h>
// statfs
#include <linux/magic.h>
#include <sys/statfs.h>
// setrlimit
#include <sys/time.h>
#include <sys/resource.h>
@@ -198,7 +201,7 @@ BeesTooLong::check() const
if (age() > m_limit) {
ostringstream oss;
m_func(oss);
BEESLOGWARN("PERFORMANCE: " << *this << " sec: " << oss.str());
BEESLOGINFO("PERFORMANCE: " << *this << " sec: " << oss.str());
}
}
@@ -246,10 +249,6 @@ bees_readahead_nolock(int const fd, const off_t offset, const size_t size)
Timer readahead_timer;
BEESNOTE("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
BEESTOOLONG("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
#if 0
// In the kernel, readahead() is identical to posix_fadvise(..., POSIX_FADV_DONTNEED)
DIE_IF_NON_ZERO(readahead(fd, offset, size));
#else
// Make sure this data is in page cache by brute force
// The btrfs kernel code does readahead with lower ioprio
// and might discard the readahead request entirely.
@@ -263,13 +262,16 @@ bees_readahead_nolock(int const fd, const off_t offset, const size_t size)
// Ignore errors and short reads. It turns out our size
// parameter isn't all that accurate, so we can't use
// the pread_or_die template.
(void)!pread(fd, dummy, this_read_size, working_offset);
BEESCOUNT(readahead_count);
BEESCOUNTADD(readahead_bytes, this_read_size);
const auto pr_rv = pread(fd, dummy, this_read_size, working_offset);
if (pr_rv >= 0) {
BEESCOUNT(readahead_count);
BEESCOUNTADD(readahead_bytes, pr_rv);
} else {
BEESCOUNT(readahead_fail);
}
working_offset += this_read_size;
working_size -= this_read_size;
}
#endif
BEESCOUNTADD(readahead_ms, readahead_timer.age() * 1000);
}
@@ -392,6 +394,73 @@ BeesStringFile::read()
return read_string(fd, st.st_size);
}
static
void
bees_fsync(int const fd)
{
// Note that when btrfs renames a temporary over an existing file,
// it flushes the temporary, so we get the right behavior if we
// just do nothing here (except when the file is first created;
// however, in that case the result is the same as if the file
// did not exist, was empty, or was filled with garbage).
//
// Kernel versions prior to 5.16 had bugs which would put ghost
// dirents in $BEESHOME if there was a crash when we called
// fsync() here.
//
// Some other filesystems will throw our data away if we don't
// call fsync, so we do need to call fsync() on those filesystems.
//
// Newer btrfs kernel versions rely on fsync() to report
// unrecoverable write errors. If we don't check the fsync()
// result, we'll lose the data when we rename(). Kernel 6.2 added
// a number of new root causes for the class of "unrecoverable
// write errors" so we need to check this now.
BEESNOTE("checking filesystem type for " << name_fd(fd));
// LSB deprecated statfs without providing a replacement that
// can fill in the f_type field.
struct statfs stf = { 0 };
DIE_IF_NON_ZERO(fstatfs(fd, &stf));
if (stf.f_type != BTRFS_SUPER_MAGIC) {
BEESLOGONCE("Using fsync on non-btrfs filesystem type " << to_hex(stf.f_type));
BEESNOTE("fsync non-btrfs " << name_fd(fd));
DIE_IF_NON_ZERO(fsync(fd));
return;
}
static bool did_uname = false;
static bool do_fsync = false;
if (!did_uname) {
Uname uname;
const string version(uname.release);
static const regex version_re(R"/(^(\d+)\.(\d+)\.)/", regex::optimize | regex::ECMAScript);
smatch m;
// Last known bug in the fsync-rename use case was fixed in kernel 5.16
static const auto min_major = 5, min_minor = 16;
if (regex_search(version, m, version_re)) {
const auto major = stoul(m[1]);
const auto minor = stoul(m[2]);
if (tie(major, minor) > tie(min_major, min_minor)) {
BEESLOGONCE("Using fsync on btrfs because kernel version is " << major << "." << minor);
do_fsync = true;
} else {
BEESLOGONCE("Not using fsync on btrfs because kernel version is " << major << "." << minor);
}
} else {
BEESLOGONCE("Not using fsync on btrfs because can't parse kernel version '" << version << "'");
}
did_uname = true;
}
if (do_fsync) {
BEESNOTE("fsync btrfs " << name_fd(fd));
DIE_IF_NON_ZERO(fsync(fd));
}
}
void
BeesStringFile::write(string contents)
{
@@ -407,19 +476,8 @@ BeesStringFile::write(string contents)
Fd ofd = openat_or_die(m_dir_fd, tmpname, FLAGS_CREATE_FILE, S_IRUSR | S_IWUSR);
BEESNOTE("writing " << tmpname << " in " << name_fd(m_dir_fd));
write_or_die(ofd, contents);
#if 0
// This triggers too many btrfs bugs. I wish I was kidding.
// Forget snapshots, balance, compression, and dedupe:
// the system call you have to fear on btrfs is fsync().
// Also note that when bees renames a temporary over an
// existing file, it flushes the temporary, so we get
// the right behavior if we just do nothing here
// (except when the file is first created; however,
// in that case the result is the same as if the file
// did not exist, was empty, or was filled with garbage).
BEESNOTE("fsyncing " << tmpname << " in " << name_fd(m_dir_fd));
DIE_IF_NON_ZERO(fsync(ofd));
#endif
bees_fsync(ofd);
}
BEESNOTE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
BEESTRACE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
@@ -444,6 +502,19 @@ BeesTempFile::resize(off_t offset)
// Count time spent here
BEESCOUNTADD(tmp_resize_ms, resize_timer.age() * 1000);
// Modify flags - every time
// - btrfs will keep trying to set FS_NOCOMP_FL behind us when compression heuristics identify
// the data as compressible, but it fails to compress
// - clear FS_NOCOW_FL because we can only dedupe between files with the same FS_NOCOW_FL state,
// and we don't open FS_NOCOW_FL files for dedupe.
BEESTRACE("Getting FS_COMPR_FL and FS_NOCOMP_FL on m_fd " << name_fd(m_fd));
int flags = ioctl_iflags_get(m_fd);
flags |= FS_COMPR_FL;
flags &= ~(FS_NOCOMP_FL | FS_NOCOW_FL);
BEESTRACE("Setting FS_COMPR_FL and clearing FS_NOCOMP_FL | FS_NOCOW_FL on m_fd " << name_fd(m_fd) << " flags " << to_hex(flags));
ioctl_iflags_set(m_fd, flags);
// That may have queued some delayed ref deletes, so throttle them
bees_throttle(resize_timer.age(), "tmpfile_resize");
}
@@ -485,13 +556,6 @@ BeesTempFile::BeesTempFile(shared_ptr<BeesContext> ctx) :
// Add this file to open_root_ino lookup table
m_roots->insert_tmpfile(m_fd);
// Set compression attribute
BEESTRACE("Getting FS_COMPR_FL on m_fd " << name_fd(m_fd));
int flags = ioctl_iflags_get(m_fd);
flags |= FS_COMPR_FL;
BEESTRACE("Setting FS_COMPR_FL on m_fd " << name_fd(m_fd) << " flags " << to_hex(flags));
ioctl_iflags_set(m_fd, flags);
// Count time spent here
BEESCOUNTADD(tmp_create_ms, create_timer.age() * 1000);
@@ -683,7 +747,7 @@ bees_main(int argc, char *argv[])
BEESLOGDEBUG("exception (ignored): " << s);
BEESCOUNT(exception_caught_silent);
} else {
BEESLOGNOTICE("\n\n*** EXCEPTION ***\n\t" << s << "\n***\n");
BEESLOG(BEES_TRACE_LEVEL, "TRACE: EXCEPTION: " << s);
BEESCOUNT(exception_caught);
}
});
@@ -704,9 +768,8 @@ bees_main(int argc, char *argv[])
shared_ptr<BeesContext> bc = make_shared<BeesContext>();
BEESLOGDEBUG("context constructed");
string cwd(readlink_or_die("/proc/self/cwd"));
// Defaults
bool use_relative_paths = false;
bool chatter_prefix_timestamp = true;
double thread_factor = 0;
unsigned thread_count = 0;
@@ -778,7 +841,7 @@ bees_main(int argc, char *argv[])
thread_min = stoul(optarg);
break;
case 'P':
crucible::set_relative_path(cwd);
use_relative_paths = true;
break;
case 'T':
chatter_prefix_timestamp = false;
@@ -796,7 +859,7 @@ bees_main(int argc, char *argv[])
root_scan_mode = static_cast<BeesRoots::ScanMode>(stoul(optarg));
break;
case 'p':
crucible::set_relative_path("");
use_relative_paths = false;
break;
case 't':
chatter_prefix_timestamp = true;
@@ -866,18 +929,19 @@ bees_main(int argc, char *argv[])
BEESLOGNOTICE("setting root path to '" << root_path << "'");
bc->set_root_path(root_path);
// Set path prefix
if (use_relative_paths) {
crucible::set_relative_path(name_fd(bc->root_fd()));
}
// Workaround for btrfs send
bc->roots()->set_workaround_btrfs_send(workaround_btrfs_send);
// Set root scan mode
bc->roots()->set_scan_mode(root_scan_mode);
if (root_scan_mode == BeesRoots::SCAN_MODE_EXTENT) {
MultiLocker::enable_locking(false);
} else {
// Workaround for a kernel bug that the subvol-based crawlers keep triggering
MultiLocker::enable_locking(true);
}
// Workaround for the logical-ino-vs-clone kernel bug
MultiLocker::enable_locking(true);
// Start crawlers
bc->start();

View File

@@ -122,9 +122,9 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
// macros ----------------------------------------
#define BEESLOG(lv,x) do { if (lv < bees_log_level) { Chatter __chatter(lv, BeesNote::get_name()); __chatter << x; } } while (0)
#define BEESLOGTRACE(x) do { BEESLOG(LOG_DEBUG, x); BeesTracer::trace_now(); } while (0)
#define BEESTRACE(x) BeesTracer SRSLY_WTF_C(beesTracer_, __LINE__) ([&]() { BEESLOG(LOG_ERR, x << " at " << __FILE__ << ":" << __LINE__); })
#define BEES_TRACE_LEVEL LOG_DEBUG
#define BEESTRACE(x) BeesTracer SRSLY_WTF_C(beesTracer_, __LINE__) ([&]() { BEESLOG(BEES_TRACE_LEVEL, "TRACE: " << x << " at " << __FILE__ << ":" << __LINE__); })
#define BEESTOOLONG(x) BeesTooLong SRSLY_WTF_C(beesTooLong_, __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
#define BEESNOTE(x) BeesNote SRSLY_WTF_C(beesNote_, __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
@@ -134,6 +134,14 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
#define BEESLOGINFO(x) BEESLOG(LOG_INFO, x)
#define BEESLOGDEBUG(x) BEESLOG(LOG_DEBUG, x)
#define BEESLOGONCE(__x) do { \
static bool already_logged = false; \
if (!already_logged) { \
already_logged = true; \
BEESLOGNOTICE(__x); \
} \
} while (false)
#define BEESCOUNT(stat) do { \
BeesStats::s_global.add_count(#stat); \
} while (0)
@@ -185,7 +193,7 @@ class BeesTracer {
thread_local static bool tl_silent;
thread_local static bool tl_first;
public:
BeesTracer(function<void()> f, bool silent = false);
BeesTracer(const function<void()> &f, bool silent = false);
~BeesTracer();
static void trace_now();
static bool get_silent();
@@ -521,7 +529,7 @@ class BeesCrawl {
bool fetch_extents();
void fetch_extents_harder();
bool restart_crawl();
bool restart_crawl_unlocked();
BeesFileRange bti_to_bfr(const BtrfsTreeItem &bti) const;
public:
@@ -535,6 +543,7 @@ public:
void deferred(bool def_setting);
bool deferred() const;
bool finished() const;
bool restart_crawl();
};
class BeesScanMode;
@@ -543,7 +552,8 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
shared_ptr<BeesContext> m_ctx;
BeesStringFile m_crawl_state_file;
map<uint64_t, shared_ptr<BeesCrawl>> m_root_crawl_map;
using CrawlMap = map<uint64_t, shared_ptr<BeesCrawl>>;
CrawlMap m_root_crawl_map;
mutex m_mutex;
uint64_t m_crawl_dirty = 0;
uint64_t m_crawl_clean = 0;
@@ -562,7 +572,7 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
condition_variable m_stop_condvar;
bool m_stop_requested = false;
void insert_new_crawl();
CrawlMap insert_new_crawl();
Fd open_root_nocache(uint64_t root);
Fd open_root_ino_nocache(uint64_t root, uint64_t ino);
uint64_t transid_max_nocache();
@@ -578,13 +588,14 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
void current_state_set(const BeesCrawlState &bcs);
bool crawl_batch(shared_ptr<BeesCrawl> crawl);
void clear_caches();
friend class BeesScanModeExtent;
shared_ptr<BeesCrawl> insert_root(const BeesCrawlState &bcs);
bool up_to_date(const BeesCrawlState &bcs);
friend class BeesCrawl;
friend class BeesFdCache;
friend class BeesScanMode;
friend class BeesScanModeSubvol;
friend class BeesScanModeExtent;
public:
BeesRoots(shared_ptr<BeesContext> ctx);
@@ -890,5 +901,6 @@ void bees_readahead_pair(int fd, off_t offset, size_t size, int fd2, off_t offse
void bees_unreadahead(int fd, off_t offset, size_t size);
void bees_throttle(double time_used, const char *context);
string format_time(time_t t);
bool exception_check();
#endif

View File

@@ -19,7 +19,9 @@ seeker_finder(const vector<uint64_t> &vec, uint64_t lower, uint64_t upper)
if (ub != s.end()) ++ub;
if (ub != s.end()) ++ub;
for (; ub != s.end(); ++ub) {
if (*ub > upper) break;
if (*ub > upper) {
break;
}
}
return set<uint64_t>(lb, ub);
}
@@ -28,7 +30,7 @@ static bool test_fails = false;
static
void
seeker_test(const vector<uint64_t> &vec, uint64_t const target)
seeker_test(const vector<uint64_t> &vec, uint64_t const target, bool const always_out = false)
{
cerr << "Find " << target << " in {";
for (auto i : vec) {
@@ -36,11 +38,13 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
}
cerr << " } = ";
size_t loops = 0;
tl_seeker_debug_str = make_shared<ostringstream>();
bool local_test_fails = false;
bool excepted = catch_all([&]() {
auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
const auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
++loops;
return seeker_finder(vec, lower, upper);
});
}, uint64_t(32));
cerr << found;
uint64_t my_found = 0;
for (auto i : vec) {
@@ -52,13 +56,15 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
cerr << " (correct)";
} else {
cerr << " (INCORRECT - right answer is " << my_found << ")";
test_fails = true;
local_test_fails = true;
}
});
cerr << " (" << loops << " loops)" << endl;
if (excepted) {
test_fails = true;
if (excepted || local_test_fails || always_out) {
cerr << dynamic_pointer_cast<ostringstream>(tl_seeker_debug_str)->str();
}
test_fails = test_fails || local_test_fails;
tl_seeker_debug_str.reset();
}
static
@@ -89,6 +95,39 @@ test_seeker()
seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max());
seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max() - 1);
seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() - 1 }, numeric_limits<uint64_t>::max());
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 0);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 1);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 2);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 3);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 4);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 5);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 6);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 7);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 8);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 9);
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 1 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 2 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 3 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 4 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 5 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 6 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 7 );
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 8 );
// Pulled from a bees debug log
seeker_test(vector<uint64_t> {
6821962845,
6821962848,
6821963411,
6821963422,
6821963536,
6821963539,
6821963835, // <- appeared during the search, causing an exception
6821963841,
6822575316,
}, 6821971036, true);
}