mirror of https://github.com/Zygo/bees.git synced 2025-08-09 17:13:47 +02:00

Go to file

Zygo Blaxell 183b6a5361 extent scan: refactor BeesCrawl, BeesScanMode*

The main gains here are:

* Move extent tree searches into BeesScanModeExtent so that they are
not slowed down by the BeesCrawl code, which was designed for the
much more specialized metadata in subvol trees.
* Enable short extent skipping now that BeesCrawl is out of the way.
* Stop enumerating btrfs subvols when in extent scan mode.

All this gets rid of >99% of unnecessary extent tree searches.
Incremental extent scan cycles now finish in milliseconds instead
of minutes.

BeesCrawl was never designed to cope with the structure and content of
the extent tree.  It would waste thousands of tree-search ioctl calls
reading and ignoring metadata items.

Performance was particularly bad when a binary search was involved, as any
binary search probe that landed in a metadata block group would read and
discard all the metadata items in the block group, sequentially, repeated
for each level of the binary search.  This was blocking implementation of
short extent skipping optimization for large extent size tiers, because
the skips were using thousands of tree searches to skip over only a few
hundred extent items.

Extent scan also had to read every extent item twice to do the
transid filtering, because BeesCrawl's interface discarded the relevant
information when it converted a `BtrfsTreeItem` into a `BeesFileRange`.
The cost of this extra fetch was negligible, but it could have been zero.

Fix this by:

* Copy the equivalent of `fetch_extents` from BeesCrawl into
`BeesScanModeExtent`, then give each of the extent scan crawlers its
own `BtrfsDataExtentTreeFetcher` instance.  This enables extent tree
searches to avoid pure (non-mixed) metadata block groups.  `BeesCrawl`
is now used only for its interface to `BeesRoots` for saving state in
`beescrawl.dat`, and never to determine the next extent tree item.

* Move subvol-specific parts of `BeesRoots` into a new class
`BeesScanModeSubvol` so that `BtrfsScanModeExtent` doesn't have to enable
or support them.  In particular, `bees -m4` no longer enumerates all
of the _subvol_ crawlers.  `BeesRoots` is still used to save and load
crawl state.

* Move several members from `BtrfsScanModeExtent` into a per-crawler
state object `SizeTier` to eliminate the need for some locks and to
maintain separate cache state for `BtrfsDataExtentTreeFetcher`.

* Reuse the `BtrfsTreeItem` to get the generation field for the transid
range filter.

* Avoid a few corner cases when handling errors, where extent scan might
drop an extent without scanning it, or fail to advance to the next extent.

* Enable the extent-skipping algorithm for large size tiers, now that
`BeesCrawl::fetch_extents` is no longer slowing it down.

* Add a debug stream interface which developers can easily turn on when
needed to inspect the decisions that extent scan is making.

* Track metrics that are more useful, particularly searches per extent
scanned, and fraction of extents that are skipped.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>

2025-02-06 22:43:22 -05:00

bin

bees: remove local cruft, throw at github

2016-11-17 12:12:13 -05:00

docs

readahead: clean up the code, update docs

2025-02-06 22:42:15 -05:00

include/crucible

btrfs-tree: add a method to get root backref items to BtrfsRootFetcher

2025-02-06 22:42:15 -05:00

lib

btrfs-tree: harden rlower_bound against exceptional objects

2025-02-06 22:42:15 -05:00

scripts

scripts/beesd: harden the mount options

2025-01-20 01:00:41 -05:00

src

extent scan: refactor BeesCrawl, BeesScanMode*

2025-02-06 22:43:22 -05:00

test

table: add a simple text table renderer

2024-11-30 23:30:33 -05:00

.gitignore

gitignore: clang creates a lot of *.tmp files

2021-11-29 21:27:48 -05:00

COPYING

GPL-3: license it

2016-11-17 12:12:15 -05:00

Defines.mk

beesd: Honor DESTDIR on installation.

2022-12-23 11:10:17 +08:00

Makefile

Makefile: also drop fiemap and fiewalk from main Makefile

2023-01-28 11:21:51 +01:00

makeflags

lib: deprecate memset_zero template, use C99 compound literals instead

2021-11-29 21:27:48 -05:00

README.md

docs: update README.md

2025-01-11 23:39:55 -05:00

README.md

BEES

Best-Effort Extent-Same, a btrfs deduplication agent.

About bees

bees is a block-oriented userspace deduplication agent designed to scale up to large btrfs filesystems. It is an offline dedupe combined with an incremental data scan capability to minimize time data spends on disk from write to dedupe.

Strengths

Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
Daemon mode - incrementally dedupes new data as it appears
Largest extents first - recover more free space during fixed maintenance windows
Works with btrfs compression - dedupe any combination of compressed and uncompressed files
Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
Persistent hash table for rapid restart after shutdown
Constant hash table size - no increased RAM usage if data set becomes larger
Works on live data - no scheduled downtime required
Automatic self-throttling - reduces system load
btrfs support - recovers more free space from btrfs than naive dedupers

Weaknesses

Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
Requires root privilege (CAP_SYS_ADMIN plus the usual filesystem read/modify caps)
First run may increase metadata space usage if many snapshots exist
Constant hash table size - no decreased RAM usage if data set becomes smaller
btrfs only

Installation and Usage

More Information

Bug Reports and Contributions

Email bug reports and patches to Zygo Blaxell bees@furryterror.org.

You can also use Github:

    https://github.com/Zygo/bees

Copyright & License

GPL (version 3 or later).

README.md

BEES

About bees

Strengths

Weaknesses

Installation and Usage

Recommended Reading

More Information

Bug Reports and Contributions

Copyright & License