1
0
mirror of https://github.com/Zygo/bees.git synced 2025-05-17 13:25:45 +02:00

README: more docs

This commit is contained in:
Zygo Blaxell 2016-11-16 16:04:26 -05:00
parent 4c9982e870
commit 74de78947d

337
README.md
View File

@ -1,33 +1,289 @@
BEES
====
Best-Effort Extent-Same, a btrfs deduplicator.
Best-Effort Extent-Same, a btrfs deduplication daemon.
TODO
----
About Bees
----------
Write some docs here:
Bees is a daemon designed to run continuously on live file servers.
Bees consumes entire filesystems and deduplicates in a single pass, using
minimal RAM to store data. Bees maintains persistent state so it can be
interrupted and resumed, whether by planned upgrades or unplanned crashes.
Bees makes continuous incremental progress instead of using separate
scan and dedup phases. Bees uses the Linux kernel's `dedupe_file_range`
system call to ensure data is handled safely even if other applications
concurrently modify it.
* copyright (Zygo Blaxell 2015-2016), license (GPL3+)
* what it is
* what it isn't
* building it
* what works
* what doesn't work
* a brief history of btrfs kernel bugs
* things that could have been, and why they aren't
* roadmap (and anti-roadmap)
* how to report bugs
* how to contribute
Bees is intentionally btrfs-specific for performance and capability.
Bees uses the btrfs `SEARCH_V2` ioctl to scan for new data
without the overhead of repeatedly walking filesystem trees with the
POSIX API. Bees uses `LOGICAL_INO` and `INO_PATHS` to leverage btrfs's
existing metadata instead of building its own redundant data structures.
Bees can cope with Btrfs filesystem compression. Bees can reassemble
Btrfs extents to deduplicate extents that contain a mix of duplicate
and unique data blocks.
Bees includes a number of workarounds for Btrfs kernel bugs to (try to)
avoid ruining your day. You're welcome.
How Bees Works
--------------
Bees uses a fixed-size persistent dedup hash table with a variable dedup
block size. Any size of hash table can be dedicated to dedup. Bees will
scale the dedup block size to fit the filesystem's unique data size
using a weighted sampling algorithm. This allows Bees to adapt itself
to its filesystem size without forcing admins to do math at install time.
At the same time, the duplicate block alignment constraint can be as low
as 4K, allowing efficient deduplication of files with narrowly-aligned
duplicate block offsets (e.g. compiled binaries and VM/disk images).
The Bees hash table is loaded into RAM at startup (using hugepages if
available), mlocked, and synced to persistent storage by trickle-writing
over a period of several hours. This avoids issues related to seeking
or fragmentation, and enables the hash table to be efficiently stored
on Btrfs with compression (or an ext4 filesystem, or a raw disk, or
on CIFS...).
Once a duplicate block is identified, Bees examines the nearby blocks
in the files where block appears. This allows Bees to find long runs
of adjacent duplicate block pairs if it has an entry for any one of
the blocks in its hash table. The stored hash entry plus the block
recently scanned from disk form a duplicate pair. On typical data sets,
this means most of the blocks in the hash table are redundant and can
be discarded without significant performance impact.
Hash table entries are grouped together into LRU lists. As each block
is scanned, its hash table entry is inserted into the LRU list at a
random position. If the LRU list is full, the entry at the end of the
list is deleted. If a hash table entry is used to discover duplicate
blocks, the entry is moved to the beginning of the list. This makes Bees
unable to detect a small number of duplicates (less than 1% on typical
filesystems), but it dramatically improves efficiency on filesystems
with many small files. Bees has found a net 13% more duplicate bytes
than a naive fixed-block-size algorithm with a 64K block size using the
same size of hash table, even after discarding 1% of the duplicate bytes.
Hash Table Sizing
-----------------
Hash table entries are 16 bytes each (64-bit hash, 52-bit block number,
and some metadata bits). Each entry represents a minimum of 4K on disk.
unique data size hash table size average dedup block size
1TB 4GB 4K
1TB 1GB 16K
1TB 256MB 64K
1TB 16MB 1024K
64TB 1GB 1024K
Things You Might Expect That Bees Doesn't Have
----------------------------------------------
* There's no configuration file or getopt command line option processing
(patches welcome!). There are some tunables hardcoded in the source
that could eventually become configuration options.
* There's no way to *stop* the Bees daemon. Use SIGKILL, SIGTERM, or
Ctrl-C for now. Some of the destructors are unreachable and have never
been tested. Bees will repeat some work when restarted.
* The Bees process doesn't fork and writes its log to stdout/stderr.
A shell wrapper is required to make it behave more like a daemon.
* There's no facility to exclude any part of a filesystem (patches
welcome).
* PREALLOC extents and extents containing blocks filled with zeros will
be replaced by holes unconditionally.
* Duplicate block groups that are less than 12K in length can take 30%
of the run time while saving only 3% of the disk space. There should
be an option to just not bother with those.
* There is a lot of duplicate reading of blocks in snapshots. Bees will
scan all snapshots at close to the same time to try to get better
performance by caching, but really fixing this requires rewriting the
crawler to scan the btrfs extent tree directly instead of the subvol
FS trees.
* Bees had support for multiple worker threads in the past; however,
this was removed because it made Bees too aggressive to coexist with
other applications on the same machine. It also hit the *slow backrefs*
on N CPU cores instead of just one.
Good Btrfs Feature Interactions
-------------------------------
Bees has been tested in combination with the following:
* btrfs compression (either method), mixtures of compressed and uncompressed extents
* PREALLOC extents (unconditionally replaced with holes)
* HOLE extents and btrfs no-holes feature
* Other deduplicators, reflink copies (though Bees may decide to redo their work)
* btrfs snapshots and non-snapshot subvols (RW only)
* Concurrent file modification (e.g. PostgreSQL and sqlite databases, build daemons)
* all btrfs RAID profiles (people ask about this, but it's irrelevant)
* IO errors during dedup (read errors will throw exceptions, Bees will catch them and skip over the affected extent)
* Filesystems mounted *with* the flushoncommit option
* 4K filesystem data block size / clone alignment
* 64-bit CPUs (amd64)
* Large (>16M) extents
* Huge files (>1TB--although Btrfs performance on such files isn't great in general)
* filesystems up to 25T bytes, 100M+ files
Bad Btrfs Feature Interactions
------------------------------
Bees has not been tested with the following, and undesirable interactions may occur:
* Non-4K filesystem data block size (should work if recompiled)
* 32-bit CPUs (x86, arm)
* Non-equal hash (SUM) and filesystem data block (CLONE) sizes (probably never will work)
* btrfs read-only snapshots (never tested, probably wouldn't work well)
* btrfs send/receive (receive is probably OK, but send requires RO snapshots. See above)
* btrfs qgroups (never tested, no idea what might happen)
* btrfs seed filesystems (does anyone even use those?)
* btrfs autodefrag mount option (never tested, could fight with Bees)
* btrfs nodatacow mount option or inode attribute (*could* work, but might not)
* btrfs out-of-tree kernel patches (e.g. in-band dedup or encryption)
* btrfs-convert from ext2/3/4 (never tested)
* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
* open(O_DIRECT)
* Filesystems mounted *without* the flushoncommit option
Other Caveats
-------------
* btrfs balance will invalidate parts of the dedup table. Bees will
happily rebuild the table, but it will have to scan all the blocks
again.
* btrfs defrag will cause Bees to rescan the defragmented file. If it
contained duplicate blocks and other references to the original
fragmented duplicates still exist, Bees will replace the defragmented
extents with the original fragmented ones.
* Bees creates temporary files (with O_TMPFILE) and uses them to split
and combine extents elsewhere in btrfs. These will take up to 2GB
during normal operation.
* Like all deduplicators, Bees will replace data blocks with metadata
references. It is a good idea to ensure there are several GB of
unallocated space (see `btrfs fi df`) on the filesystem before running
Bees for the first time. Use
btrfs balance start -dusage=100,limit=1 /your/filesystem
If possible, raise the `limit` parameter to the current size of metadata
usage (from `btrfs fi df`) plus 1.
A Brief List Of Btrfs Kernel Bugs
---------------------------------
Fixed bugs:
* 3.13: `FILE_EXTENT_SAME` ioctl added. No way to reliably dedup with
concurrent modifications before this.
* 3.16: `SEARCH_V2` ioctl added. Bees could use `SEARCH` instead.
* 4.2: `FILE_EXTENT_SAME` no longer updates mtime, can be used at EOF.
Kernel deadlock bugs fixed.
* 4.7: *slow backref* bug no longer triggers a softlockup panic. It still
too long to resolve a block address to a root/inode/offset triple.
Unfixed kernel bugs (as of 4.5.7) with workarounds in Bees:
* *slow backref*: If the number of references to a single shared extent
within a single file grows above a few thousand, the kernel consumes CPU
for up to 40 uninterruptible minutes while holding various locks that
block access to the filesystem. Bees avoids this bug by measuring the
time the kernel spends performing certain operations and permanently
blacklisting any extent or hash where the kernel starts to get slow.
Inside Bees, such blocks are marked as 'toxic' hash/block addresses.
* `LOGICAL_INO` output is arbitrarily limited to 2730 references
even if more buffer space is provided for results. Once this number
has been reached, Bees can no longer replace the extent since it can't
find and remove all existing references. Bees refrains from adding
any more references after the first 2560. Offending blocks are
marked 'toxic' even if there is no corresponding performance problem.
This places an obvious limit on dedup efficiency for extremely common
blocks or filesystems with many snapshots (although this limit is
far greater than the effective limit imposed by the *slow backref* bug).
* `FILE_EXTENT_SAME` is arbitrarily limited to 16MB. This is less than
128MB which is the maximum extent size that can be created by defrag
or prealloc. Bees avoids feedback loops this can generate while
attempting to replace extents over 16MB in length.
* `DEFRAG_RANGE` is useless. The ioctl attempts to implement `btrfs
fi defrag` in the kernel, and will arbitrarily defragment more or
less than the range requested to match the behavior expected from the
userspace tool. Bees implements its own defrag instead, copying data
to a temporary file and using the `FILE_EXTENT_SAME` ioctl to replace
precisely the specified range of offending fragmented blocks.
* When writing BeesStringFile, a crash can cause the directory entry
`beescrawl.UUID.dat.tmp` to exist without a corresponding inode.
This directory entry cannot be renamed or removed; however, it does
not prevent the creation of a second directory entry with the same
name that functions normally, so it doesn't prevent Bees operation.
The orphan directory entry can be removed by deleting its subvol,
so place BEESHOME on a separate subvol so you can delete these orphan
directory entries when they occur (or use btrfs zero-log before mounting
the filesystem after a crash).
* If the fsync() BeesTempFile::make_copy is removed, the filesystem
hangs within a few hours, requiring a reboot to recover.
Not really a bug, but a gotcha nonetheless:
* If a process holds a directory FD open, the subvol containing the
directory cannot be deleted (`btrfs sub del` will start the deletion
process, but it will not proceed past the first open directory FD).
`btrfs-cleaner` will simply skip over the directory *and all of its
children* until the FD is closed. Bees avoids this gotcha by closing
all of the FDs in its directory FD cache every 15 minutes.
Requirements
------------
* C++11 compiler (tested with GCC 4.9)
Sorry. I really like closures.
* btrfs-progs (tested with 4.1..4.7)
Needed for btrfs.h and ctree.h during compile.
Not needed at runtime.
* libuuid-dev
TODO: remove the one function used from this library.
It supports a feature Bees no longer implements.
* Linux kernel 4.2 or later
Don't bother trying to make Bees work with older kernels.
It won't end well.
* 64-bit host and target CPU
This code has never been tested on a 32-bit target CPU.
A 64-bit host CPU may be required for the self-tests.
Some of the ioctls don't work properly with a 64-bit
kernel and 32-bit userspace.
Build
-----
Requirements:
* C++11 compiler (I use GCC 4.9)
* btrfs-progs (I've used 4.1..4.7) for /usr/include/btrfs/*
* libuuid-dev (TODO: remove the one function we call from this library)
Build with `make`.
The build produces `bin/bees` and `lib/libcrucible.so`, which must be
@ -66,23 +322,54 @@ in src/bees.h.
Running
-------
We created this directory in the previous section.
We created this directory in the previous section:
export BEESHOME=/some/path
Use a tmpfs for BEESSTATUS, it updates once per second
Use a tmpfs for BEESSTATUS, it updates once per second:
export BEESSTATUS=/run/bees.status
bees can only process the root subvol of a btrfs.
Use a bind mount, and let only bees access it.
bees can only process the root subvol of a btrfs (seriously--if the
argument is not the root subvol directory, Bees will just throw an
exception and stop).
Use a bind mount, and let only bees access it:
mount -osubvol=/ /dev/<your-filesystem> /var/lib/bees/root
Let the bees fly!
Reduce CPU and IO priority to be kinder to other applications
sharing this host (or raise them for more aggressive disk space
recovery). If you use cgroups, put bees in its own cgroup, then reduce
the `blkio.weight` and `cpu.shares` parameters. You can also use
`schedtool` and `ionice in the shell script that launches bees:
schedtool -D -n20 $$
ionice -c3 -p $$
Let the bees fly:
bees /var/lib/bees/root >> /var/log/bees.log 2>&1
You'll probably want to arrange for /var/log/bees.log to be rotated
periodically. You may also want to set umask to 077 to prevent disclosure
of information about the contents of the filesystem through the log file.
Bug Reports and Contributions
-----------------------------
Email bug reports and patches to Zygo Blaxell <bees@furryterror.org>.
You can also use Github:
https://github.com/Zygo/bees
Copyright & License
===================
Copyright 2015-2016 Zygo Blaxell <bees@furryterror.org>.
GPL (version 3 or later).