mirror of
https://github.com/Zygo/bees.git
synced 2025-05-17 21:35:45 +02:00
When we have multiple possible matches for a block, we proceed in three phases: 1. retrieve each match's extent refs and put them in a list, 2. iterate over the list converting viable block matches into range matches, 3. sort and flatten the list of range matches into a non-overlapping list of ranges that cover all duplicate blocks exactly once. The separation of phase 1 and 2 creates a performance issue when there are many block matches in phase 1, and all the range matches in phase 2 are the same length. Even though we might quickly find the longest possible matching range early in phase 2, we first extract all of the extent refs from every possible matching block in phase 1, even though most of those refs will never be used. Fix this by moving the extent ref retrieval in phase 1 into a single loop in phase 2, and stop looping over matching blocks as soon as any dedupe range is created. This avoids iterating over a large list of blocks with expensive `LOGICAL_INO` ioctls in an attempt to improve the match when there is no hope of improvement, e.g. when all match ranges are 4K and the content is extremely prevalent in the data. If we find a matched block that is part of a short matching range, we can replace it with a block that is part of a long matching range, because there is a good chance we will find a matching hash block in the long range by looking up hashes after the end of the short range. In that case, overlapping dedupe ranges covering both blocks in the target extent will be inserted into the dedupe list, and the longest matches will be selected at phase 3. This usually provides a similar result to that of the loop in phase 1, but _much_ more efficiently. Some operations are left in phase 1, but they are all using internal functions, not ioctls. Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BEES
Best-Effort Extent-Same, a btrfs deduplication agent.
About bees
bees is a block-oriented userspace deduplication agent designed for large btrfs filesystems. It is an offline dedupe combined with an incremental data scan capability to minimize time data spends on disk from write to dedupe.
Strengths
- Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
- Daemon incrementally dedupes new data using btrfs tree search
- Works with btrfs compression - dedupe any combination of compressed and uncompressed files
- Works around btrfs filesystem structure to free more disk space
- Persistent hash table for rapid restart after shutdown
- Whole-filesystem dedupe - including snapshots
- Constant hash table size - no increased RAM usage if data set becomes larger
- Works on live data - no scheduled downtime required
- Automatic self-throttling based on system load
Weaknesses
- Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
- Requires root privilege (or
CAP_SYS_ADMIN
) - First run may require temporary disk space for extent reorganization
- First run may increase metadata space usage if many snapshots exist
- Constant hash table size - no decreased RAM usage if data set becomes smaller
- btrfs only
Installation and Usage
Recommended Reading
- bees Gotchas
- btrfs kernel bugs - especially DATA CORRUPTION WARNING
- bees vs. other btrfs features
- What to do when something goes wrong
More Information
Bug Reports and Contributions
Email bug reports and patches to Zygo Blaxell bees@furryterror.org.
You can also use Github:
https://github.com/Zygo/bees
Copyright & License
Copyright 2015-2023 Zygo Blaxell bees@furryterror.org.
GPL (version 3 or later).
Languages
C++
97%
C
1.6%
Makefile
0.8%
Shell
0.6%