mirror of
https://github.com/Zygo/bees.git
synced 2025-05-17 21:35:45 +02:00
392 lines
16 KiB
Markdown
392 lines
16 KiB
Markdown
BEES
|
|
====
|
|
|
|
Best-Effort Extent-Same, a btrfs deduplication daemon.
|
|
|
|
About Bees
|
|
----------
|
|
|
|
Bees is a daemon designed to run continuously on live file servers.
|
|
Bees scans and deduplicates whole filesystems in a single pass instead
|
|
of separate scan and dedup phases. RAM usage does _not_ depend on
|
|
unique data size or the number of input files. Hash tables and scan
|
|
progress are stored persistently so the daemon can resume after a reboot.
|
|
Bees uses the Linux kernel's `dedupe_file_range` feature to ensure data
|
|
is handled safely even if other applications concurrently modify it.
|
|
|
|
Bees is intentionally btrfs-specific for performance and capability.
|
|
Bees uses the btrfs `SEARCH_V2` ioctl to scan for new data without the
|
|
overhead of repeatedly walking filesystem trees with the POSIX API.
|
|
Bees uses `LOGICAL_INO` and `INO_PATHS` to leverage btrfs's existing
|
|
metadata instead of building its own redundant data structures.
|
|
Bees can cope with Btrfs filesystem compression. Bees can reassemble
|
|
Btrfs extents to deduplicate extents that contain a mix of duplicate
|
|
and unique data blocks.
|
|
|
|
Bees includes a number of workarounds for Btrfs kernel bugs to (try to)
|
|
avoid ruining your day. You're welcome.
|
|
|
|
How Bees Works
|
|
--------------
|
|
|
|
Bees uses a fixed-size persistent dedup hash table with a variable dedup
|
|
block size. Any size of hash table can be dedicated to dedup. Bees will
|
|
scale the dedup block size to fit the filesystem's unique data size
|
|
using a weighted sampling algorithm. This allows Bees to adapt itself
|
|
to its filesystem size without forcing admins to do math at install time.
|
|
At the same time, the duplicate block alignment constraint can be as low
|
|
as 4K, allowing efficient deduplication of files with narrowly-aligned
|
|
duplicate block offsets (e.g. compiled binaries and VM/disk images)
|
|
even if the effective block size is much larger.
|
|
|
|
The Bees hash table is loaded into RAM at startup (using hugepages if
|
|
available), mlocked, and synced to persistent storage by trickle-writing
|
|
over a period of several hours. This avoids issues related to seeking
|
|
or fragmentation, and enables the hash table to be efficiently stored
|
|
on Btrfs with compression (or an ext4 filesystem, or a raw disk, or
|
|
on CIFS...).
|
|
|
|
Once a duplicate block is identified, Bees examines the nearby blocks
|
|
in the files where block appears. This allows Bees to find long runs
|
|
of adjacent duplicate block pairs if it has an entry for any one of
|
|
the blocks in its hash table. The stored hash entry plus the block
|
|
recently scanned from disk form a duplicate pair. On typical data sets,
|
|
this means most of the blocks in the hash table are redundant and can
|
|
be discarded without significant performance impact.
|
|
|
|
Hash table entries are grouped together into LRU lists. As each block
|
|
is scanned, its hash table entry is inserted into the LRU list at a
|
|
random position. If the LRU list is full, the entry at the end of the
|
|
list is deleted. If a hash table entry is used to discover duplicate
|
|
blocks, the entry is moved to the beginning of the list. This makes Bees
|
|
unable to detect a small number of duplicates (less than 1% on typical
|
|
filesystems), but it dramatically improves efficiency on filesystems
|
|
with many small files. Bees has found a net 13% more duplicate bytes
|
|
than a naive fixed-block-size algorithm with a 64K block size using the
|
|
same size of hash table, even after discarding 1% of the duplicate bytes.
|
|
|
|
Hash Table Sizing
|
|
-----------------
|
|
|
|
Hash table entries are 16 bytes each (64-bit hash, 52-bit block number,
|
|
and some metadata bits). Each entry represents a minimum of 4K on disk.
|
|
|
|
unique data size hash table size average dedup block size
|
|
1TB 4GB 4K
|
|
1TB 1GB 16K
|
|
1TB 256MB 64K
|
|
1TB 16MB 1024K
|
|
64TB 1GB 1024K
|
|
|
|
It is possible to resize the hash table by changing the size of
|
|
`beeshash.dat` (e.g. with `truncate`) and restarting `bees`. This
|
|
does not preserve all the existing hash table entries, but it does
|
|
preserve more than zero of them--especially if the old and new sizes
|
|
are a power-of-two multiple of each other.
|
|
|
|
Things You Might Expect That Bees Doesn't Have
|
|
----------------------------------------------
|
|
|
|
* There's no configuration file or getopt command line option processing
|
|
(patches welcome!). There are some tunables hardcoded in the source
|
|
that could eventually become configuration options.
|
|
|
|
* There's no way to *stop* the Bees daemon. Use SIGKILL, SIGTERM, or
|
|
Ctrl-C for now. Some of the destructors are unreachable and have never
|
|
been tested. Bees will repeat some work when restarted.
|
|
|
|
* The Bees process doesn't fork and writes its log to stdout/stderr.
|
|
A shell wrapper is required to make it behave more like a daemon.
|
|
|
|
* There's no facility to exclude any part of a filesystem (patches
|
|
welcome).
|
|
|
|
* PREALLOC extents and extents containing blocks filled with zeros will
|
|
be replaced by holes unconditionally.
|
|
|
|
* Duplicate block groups that are less than 12K in length can take 30%
|
|
of the run time while saving only 3% of the disk space. There should
|
|
be an option to just not bother with those.
|
|
|
|
* There is a lot of duplicate reading of blocks in snapshots. Bees will
|
|
scan all snapshots at close to the same time to try to get better
|
|
performance by caching, but really fixing this requires rewriting the
|
|
crawler to scan the btrfs extent tree directly instead of the subvol
|
|
FS trees.
|
|
|
|
* Bees had support for multiple worker threads in the past; however,
|
|
this was removed because it made Bees too aggressive to coexist with
|
|
other applications on the same machine. It also hit the *slow backrefs*
|
|
on N CPU cores instead of just one.
|
|
|
|
* Block reads are currently more allocation- and CPU-intensive than they
|
|
should be, especially for filesystems on SSD where the IO overhead is
|
|
much smaller. This is a problem for power-constrained environments
|
|
(e.g. laptops with slow CPU).
|
|
|
|
* Bees can currently fragment extents when required to remove duplicate
|
|
blocks, but has no defragmentation capability yet. When possible, Bees
|
|
will attempt to work with existing extent boundaries, but it will not
|
|
aggregate blocks together from multiple extents to create larger ones.
|
|
|
|
Good Btrfs Feature Interactions
|
|
-------------------------------
|
|
|
|
Bees has been tested in combination with the following:
|
|
|
|
* btrfs compression (either method), mixtures of compressed and uncompressed extents
|
|
* PREALLOC extents (unconditionally replaced with holes)
|
|
* HOLE extents and btrfs no-holes feature
|
|
* Other deduplicators, reflink copies (though Bees may decide to redo their work)
|
|
* btrfs snapshots and non-snapshot subvols (RW only)
|
|
* Concurrent file modification (e.g. PostgreSQL and sqlite databases, build daemons)
|
|
* all btrfs RAID profiles (people ask about this, but it's irrelevant)
|
|
* IO errors during dedup (read errors will throw exceptions, Bees will catch them and skip over the affected extent)
|
|
* Filesystems mounted *with* the flushoncommit option
|
|
* 4K filesystem data block size / clone alignment
|
|
* 64-bit CPUs (amd64)
|
|
* Large (>16M) extents
|
|
* Huge files (>1TB--although Btrfs performance on such files isn't great in general)
|
|
* filesystems up to 25T bytes, 100M+ files
|
|
|
|
|
|
Bad Btrfs Feature Interactions
|
|
------------------------------
|
|
|
|
Bees has not been tested with the following, and undesirable interactions may occur:
|
|
|
|
* Non-4K filesystem data block size (should work if recompiled)
|
|
* 32-bit CPUs (x86, arm)
|
|
* Non-equal hash (SUM) and filesystem data block (CLONE) sizes (probably never will work)
|
|
* btrfs read-only snapshots (never tested, probably wouldn't work well)
|
|
* btrfs send/receive (receive is probably OK, but send requires RO snapshots. See above)
|
|
* btrfs qgroups (never tested, no idea what might happen)
|
|
* btrfs seed filesystems (does anyone even use those?)
|
|
* btrfs autodefrag mount option (never tested, could fight with Bees)
|
|
* btrfs nodatacow mount option or inode attribute (*could* work, but might not)
|
|
* btrfs out-of-tree kernel patches (e.g. in-band dedup or encryption)
|
|
* btrfs-convert from ext2/3/4 (never tested)
|
|
* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
|
|
* open(O_DIRECT)
|
|
* Filesystems mounted *without* the flushoncommit option
|
|
|
|
Other Caveats
|
|
-------------
|
|
|
|
* btrfs balance will invalidate parts of the dedup table. Bees will
|
|
happily rebuild the table, but it will have to scan all the blocks
|
|
again.
|
|
|
|
* btrfs defrag will cause Bees to rescan the defragmented file. If it
|
|
contained duplicate blocks and other references to the original
|
|
fragmented duplicates still exist, Bees will replace the defragmented
|
|
extents with the original fragmented ones.
|
|
|
|
* Bees creates temporary files (with O_TMPFILE) and uses them to split
|
|
and combine extents elsewhere in btrfs. These will take up to 2GB
|
|
during normal operation.
|
|
|
|
* Like all deduplicators, Bees will replace data blocks with metadata
|
|
references. It is a good idea to ensure there are several GB of
|
|
unallocated space (see `btrfs fi df`) on the filesystem before running
|
|
Bees for the first time. Use
|
|
|
|
btrfs balance start -dusage=100,limit=1 /your/filesystem
|
|
|
|
If possible, raise the `limit` parameter to the current size of metadata
|
|
usage (from `btrfs fi df`) plus 1.
|
|
|
|
|
|
A Brief List Of Btrfs Kernel Bugs
|
|
---------------------------------
|
|
|
|
Fixed bugs:
|
|
|
|
* 3.13: `FILE_EXTENT_SAME` ioctl added. No way to reliably dedup with
|
|
concurrent modifications before this.
|
|
* 3.16: `SEARCH_V2` ioctl added. Bees could use `SEARCH` instead.
|
|
* 4.2: `FILE_EXTENT_SAME` no longer updates mtime, can be used at EOF.
|
|
Kernel deadlock bugs fixed.
|
|
* 4.7: *slow backref* bug no longer triggers a softlockup panic. It still
|
|
too long to resolve a block address to a root/inode/offset triple.
|
|
|
|
Unfixed kernel bugs (as of 4.5.7) with workarounds in Bees:
|
|
|
|
* *slow backref*: If the number of references to a single shared extent
|
|
within a single file grows above a few thousand, the kernel consumes CPU
|
|
for up to 40 uninterruptible minutes while holding various locks that
|
|
block access to the filesystem. Bees avoids this bug by measuring the
|
|
time the kernel spends performing certain operations and permanently
|
|
blacklisting any extent or hash where the kernel starts to get slow.
|
|
Inside Bees, such blocks are marked as 'toxic' hash/block addresses.
|
|
|
|
* `LOGICAL_INO` output is arbitrarily limited to 2730 references
|
|
even if more buffer space is provided for results. Once this number
|
|
has been reached, Bees can no longer replace the extent since it can't
|
|
find and remove all existing references. Bees refrains from adding
|
|
any more references after the first 2560. Offending blocks are
|
|
marked 'toxic' even if there is no corresponding performance problem.
|
|
This places an obvious limit on dedup efficiency for extremely common
|
|
blocks or filesystems with many snapshots (although this limit is
|
|
far greater than the effective limit imposed by the *slow backref* bug).
|
|
|
|
* `FILE_EXTENT_SAME` is arbitrarily limited to 16MB. This is less than
|
|
128MB which is the maximum extent size that can be created by defrag
|
|
or prealloc. Bees avoids feedback loops this can generate while
|
|
attempting to replace extents over 16MB in length.
|
|
|
|
* `DEFRAG_RANGE` is useless. The ioctl attempts to implement `btrfs
|
|
fi defrag` in the kernel, and will arbitrarily defragment more or
|
|
less than the range requested to match the behavior expected from the
|
|
userspace tool. Bees implements its own defrag instead, copying data
|
|
to a temporary file and using the `FILE_EXTENT_SAME` ioctl to replace
|
|
precisely the specified range of offending fragmented blocks.
|
|
|
|
* When writing BeesStringFile, a crash can cause the directory entry
|
|
`beescrawl.UUID.dat.tmp` to exist without a corresponding inode.
|
|
This directory entry cannot be renamed or removed; however, it does
|
|
not prevent the creation of a second directory entry with the same
|
|
name that functions normally, so it doesn't prevent Bees operation.
|
|
|
|
The orphan directory entry can be removed by deleting its subvol,
|
|
so place BEESHOME on a separate subvol so you can delete these orphan
|
|
directory entries when they occur (or use btrfs zero-log before mounting
|
|
the filesystem after a crash).
|
|
|
|
* If the fsync() BeesTempFile::make_copy is removed, the filesystem
|
|
hangs within a few hours, requiring a reboot to recover.
|
|
|
|
Not really a bug, but a gotcha nonetheless:
|
|
|
|
* If a process holds a directory FD open, the subvol containing the
|
|
directory cannot be deleted (`btrfs sub del` will start the deletion
|
|
process, but it will not proceed past the first open directory FD).
|
|
`btrfs-cleaner` will simply skip over the directory *and all of its
|
|
children* until the FD is closed. Bees avoids this gotcha by closing
|
|
all of the FDs in its directory FD cache every 15 minutes.
|
|
|
|
|
|
|
|
Requirements
|
|
------------
|
|
|
|
* C++11 compiler (tested with GCC 4.9)
|
|
|
|
Sorry. I really like closures.
|
|
|
|
* btrfs-progs (tested with 4.1..4.7)
|
|
|
|
Needed for btrfs.h and ctree.h during compile.
|
|
Not needed at runtime.
|
|
|
|
* libuuid-dev
|
|
|
|
TODO: remove the one function used from this library.
|
|
It supports a feature Bees no longer implements.
|
|
|
|
* Linux kernel 4.2 or later
|
|
|
|
Don't bother trying to make Bees work with older kernels.
|
|
It won't end well.
|
|
|
|
* 64-bit host and target CPU
|
|
|
|
This code has never been tested on a 32-bit target CPU.
|
|
|
|
A 64-bit host CPU may be required for the self-tests.
|
|
Some of the ioctls don't work properly with a 64-bit
|
|
kernel and 32-bit userspace.
|
|
|
|
Build
|
|
-----
|
|
|
|
Build with `make`.
|
|
|
|
The build produces `bin/bees` and `lib/libcrucible.so`, which must be
|
|
copied to somewhere in `$PATH` and `$LD_LIBRARY_PATH` on the target
|
|
system respectively.
|
|
|
|
Setup
|
|
-----
|
|
|
|
Create a directory for bees state files:
|
|
|
|
export BEESHOME=/some/path
|
|
mkdir -p "$BEESHOME"
|
|
|
|
Create an empty hash table (your choice of size, but it must be a multiple
|
|
of 16M). This example creates a 1GB hash table:
|
|
|
|
truncate -s 1g "$BEESHOME/beeshash.dat"
|
|
chmod 700 "$BEESHOME/beeshash.dat"
|
|
|
|
Configuration
|
|
-------------
|
|
|
|
The only runtime configurable options are environment variables:
|
|
|
|
* BEESHOME: Directory containing Bees state files:
|
|
* beeshash.dat | persistent hash table (must be a multiple of 16M)
|
|
* beescrawl.`UUID`.dat | state of SEARCH_V2 crawlers
|
|
* beesstats.txt | statistics and performance counters
|
|
* BEESSTATS: File containing a snapshot of current Bees state (performance
|
|
counters and current status of each thread).
|
|
|
|
Other options (e.g. interval between filesystem crawls) can be configured
|
|
in src/bees.h.
|
|
|
|
Running
|
|
-------
|
|
|
|
We created this directory in the previous section:
|
|
|
|
export BEESHOME=/some/path
|
|
|
|
Use a tmpfs for BEESSTATUS, it updates once per second:
|
|
|
|
export BEESSTATUS=/run/bees.status
|
|
|
|
bees can only process the root subvol of a btrfs (seriously--if the
|
|
argument is not the root subvol directory, Bees will just throw an
|
|
exception and stop).
|
|
|
|
Use a bind mount, and let only bees access it:
|
|
|
|
mount -osubvol=/ /dev/<your-filesystem> /var/lib/bees/root
|
|
|
|
Reduce CPU and IO priority to be kinder to other applications
|
|
sharing this host (or raise them for more aggressive disk space
|
|
recovery). If you use cgroups, put `bees` in its own cgroup, then reduce
|
|
the `blkio.weight` and `cpu.shares` parameters. You can also use
|
|
`schedtool` and `ionice` in the shell script that launches `bees`:
|
|
|
|
schedtool -D -n20 $$
|
|
ionice -c3 -p $$
|
|
|
|
Let the bees fly:
|
|
|
|
bees /var/lib/bees/root >> /var/log/bees.log 2>&1
|
|
|
|
You'll probably want to arrange for /var/log/bees.log to be rotated
|
|
periodically. You may also want to set umask to 077 to prevent disclosure
|
|
of information about the contents of the filesystem through the log file.
|
|
|
|
|
|
Bug Reports and Contributions
|
|
-----------------------------
|
|
|
|
Email bug reports and patches to Zygo Blaxell <bees@furryterror.org>.
|
|
|
|
You can also use Github:
|
|
|
|
https://github.com/Zygo/bees
|
|
|
|
|
|
|
|
Copyright & License
|
|
===================
|
|
|
|
Copyright 2015-2016 Zygo Blaxell <bees@furryterror.org>.
|
|
|
|
GPL (version 3 or later).
|