mirror of
https://github.com/Zygo/bees.git
synced 2025-05-17 21:35:45 +02:00
README: split into sections, reformat for github.io
Split the rather large README into smaller sections with a pitch and a ToC at the top. Move the sections into docs/ so that Github Pages can read them. 'make doc' produces a local HTML tree. Update the kernel bugs and gotchas list. Add some information that has been accumulating in Github comments. Remove information about bugs in kernels earlier than 4.14. Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit is contained in:
parent
32d2739b0d
commit
e8298570ed
14
Makefile
14
Makefile
@ -7,8 +7,6 @@ LIBEXEC_PREFIX ?= $(LIB_PREFIX)/bees
|
|||||||
|
|
||||||
SYSTEMD_SYSTEM_UNIT_DIR ?= $(shell pkg-config systemd --variable=systemdsystemunitdir)
|
SYSTEMD_SYSTEM_UNIT_DIR ?= $(shell pkg-config systemd --variable=systemdsystemunitdir)
|
||||||
|
|
||||||
MARKDOWN := $(firstword $(shell type -P markdown markdown2 markdown_py 2>/dev/null || echo markdown))
|
|
||||||
|
|
||||||
BEES_VERSION ?= $(shell git describe --always --dirty || echo UNKNOWN)
|
BEES_VERSION ?= $(shell git describe --always --dirty || echo UNKNOWN)
|
||||||
|
|
||||||
# allow local configuration to override above variables
|
# allow local configuration to override above variables
|
||||||
@ -25,13 +23,12 @@ include Defines.mk
|
|||||||
default: $(DEFAULT_MAKE_TARGET)
|
default: $(DEFAULT_MAKE_TARGET)
|
||||||
|
|
||||||
all: lib src scripts
|
all: lib src scripts
|
||||||
docs: README.html
|
reallyall: all doc test
|
||||||
reallyall: all docs test
|
|
||||||
|
|
||||||
clean: ## Cleanup
|
clean: ## Cleanup
|
||||||
git clean -dfx -e localconf
|
git clean -dfx -e localconf
|
||||||
|
|
||||||
.PHONY: lib src test
|
.PHONY: lib src test doc
|
||||||
|
|
||||||
lib: ## Build libs
|
lib: ## Build libs
|
||||||
$(MAKE) -C lib
|
$(MAKE) -C lib
|
||||||
@ -44,15 +41,14 @@ test: ## Run tests
|
|||||||
test: lib src
|
test: lib src
|
||||||
$(MAKE) -C test
|
$(MAKE) -C test
|
||||||
|
|
||||||
|
doc: ## Build docs
|
||||||
|
$(MAKE) -C docs
|
||||||
|
|
||||||
scripts/%: scripts/%.in
|
scripts/%: scripts/%.in
|
||||||
$(TEMPLATE_COMPILER)
|
$(TEMPLATE_COMPILER)
|
||||||
|
|
||||||
scripts: scripts/beesd scripts/beesd@.service
|
scripts: scripts/beesd scripts/beesd@.service
|
||||||
|
|
||||||
README.html: README.md
|
|
||||||
$(MARKDOWN) README.md > README.html.new
|
|
||||||
mv -f README.html.new README.html
|
|
||||||
|
|
||||||
install_libs: lib
|
install_libs: lib
|
||||||
install -Dm644 lib/libcrucible.so $(DESTDIR)$(LIB_PREFIX)/libcrucible.so
|
install -Dm644 lib/libcrucible.so $(DESTDIR)$(LIB_PREFIX)/libcrucible.so
|
||||||
|
|
||||||
|
617
README.md
617
README.md
@ -1,591 +1,60 @@
|
|||||||
BEES
|
BEES
|
||||||
====
|
====
|
||||||
|
|
||||||
Best-Effort Extent-Same, a btrfs dedup agent.
|
Best-Effort Extent-Same, a btrfs deduplication agent.
|
||||||
|
|
||||||
About Bees
|
About bees
|
||||||
----------
|
----------
|
||||||
|
|
||||||
Bees is a block-oriented userspace dedup agent designed to avoid
|
bees is a block-oriented userspace deduplication agent designed for large
|
||||||
scalability problems on large filesystems.
|
btrfs filesystems. It is an offline dedupe combined with an incremental
|
||||||
|
data scan capability to minimize time data spends on disk from write
|
||||||
|
to dedupe.
|
||||||
|
|
||||||
Bees is designed to degrade gracefully when underprovisioned with RAM.
|
Strengths
|
||||||
Bees does not use more RAM or storage as filesystem data size increases.
|
|
||||||
The dedup hash table size is fixed at creation time and does not change.
|
|
||||||
The effective dedup block size is dynamic and adjusts automatically to
|
|
||||||
fit the hash table into the configured RAM limit. Hash table overflow
|
|
||||||
is not implemented to eliminate the IO overhead of hash table overflow.
|
|
||||||
Hash table entries are only 16 bytes per dedup block to keep the average
|
|
||||||
dedup block size small.
|
|
||||||
|
|
||||||
Bees does not require alignment between dedup blocks or extent boundaries
|
|
||||||
(i.e. it can handle any multiple-of-4K offset between dup block pairs).
|
|
||||||
Bees rearranges blocks into shared and unique extents if required to
|
|
||||||
work within current btrfs kernel dedup limitations.
|
|
||||||
|
|
||||||
Bees can dedup any combination of compressed and uncompressed extents.
|
|
||||||
|
|
||||||
Bees operates in a single pass which removes duplicate extents immediately
|
|
||||||
during scan. There are no separate scanning and dedup phases.
|
|
||||||
|
|
||||||
Bees uses only data-safe btrfs kernel operations, so it can dedup live
|
|
||||||
data (e.g. build servers, sqlite databases, VM disk images). It does
|
|
||||||
not modify file attributes or timestamps.
|
|
||||||
|
|
||||||
Bees does not store any information about filesystem structure, so it is
|
|
||||||
not affected by the number or size of files (except to the extent that
|
|
||||||
these cause performance problems for btrfs in general). It retrieves such
|
|
||||||
information on demand through btrfs SEARCH_V2 and LOGICAL_INO ioctls.
|
|
||||||
This eliminates the storage required to maintain the equivalents of
|
|
||||||
these functions in userspace. It's also why bees has no XFS support.
|
|
||||||
|
|
||||||
Bees is a daemon designed to run continuously and maintain its state
|
|
||||||
across crahes and reboots. Bees uses checkpoints for persistence to
|
|
||||||
eliminate the IO overhead of a transactional data store. On restart,
|
|
||||||
bees will dedup any data that was added to the filesystem since the
|
|
||||||
last checkpoint.
|
|
||||||
|
|
||||||
Bees is used to dedup filesystems ranging in size from 16GB to 35TB, with
|
|
||||||
hash tables ranging in size from 128MB to 11GB.
|
|
||||||
|
|
||||||
How Bees Works
|
|
||||||
--------------
|
|
||||||
|
|
||||||
Bees uses a fixed-size persistent dedup hash table with a variable dedup
|
|
||||||
block size. Any size of hash table can be dedicated to dedup. Bees will
|
|
||||||
scale the dedup block size to fit the filesystem's unique data size
|
|
||||||
using a weighted sampling algorithm. This allows Bees to adapt itself
|
|
||||||
to its filesystem size without forcing admins to do math at install time.
|
|
||||||
At the same time, the duplicate block alignment constraint can be as low
|
|
||||||
as 4K, allowing efficient deduplication of files with narrowly-aligned
|
|
||||||
duplicate block offsets (e.g. compiled binaries and VM/disk images)
|
|
||||||
even if the effective block size is much larger.
|
|
||||||
|
|
||||||
The Bees hash table is loaded into RAM at startup (using hugepages if
|
|
||||||
available), mlocked, and synced to persistent storage by trickle-writing
|
|
||||||
over a period of several hours. This avoids issues related to seeking
|
|
||||||
or fragmentation, and enables the hash table to be efficiently stored
|
|
||||||
on Btrfs with compression (or an ext4 filesystem, or a raw disk, or
|
|
||||||
on CIFS...).
|
|
||||||
|
|
||||||
Once a duplicate block is identified, Bees examines the nearby blocks
|
|
||||||
in the files where block appears. This allows Bees to find long runs
|
|
||||||
of adjacent duplicate block pairs if it has an entry for any one of
|
|
||||||
the blocks in its hash table. The stored hash entry plus the block
|
|
||||||
recently scanned from disk form a duplicate pair. On typical data sets,
|
|
||||||
this means most of the blocks in the hash table are redundant and can
|
|
||||||
be discarded without significant performance impact.
|
|
||||||
|
|
||||||
Hash table entries are grouped together into LRU lists. As each block
|
|
||||||
is scanned, its hash table entry is inserted into the LRU list at a
|
|
||||||
random position. If the LRU list is full, the entry at the end of the
|
|
||||||
list is deleted. If a hash table entry is used to discover duplicate
|
|
||||||
blocks, the entry is moved to the beginning of the list. This makes Bees
|
|
||||||
unable to detect a small number of duplicates (less than 1% on typical
|
|
||||||
filesystems), but it dramatically improves efficiency on filesystems
|
|
||||||
with many small files. Bees has found a net 13% more duplicate bytes
|
|
||||||
than a naive fixed-block-size algorithm with a 64K block size using the
|
|
||||||
same size of hash table, even after discarding 1% of the duplicate bytes.
|
|
||||||
|
|
||||||
Hash Table Sizing
|
|
||||||
-----------------
|
|
||||||
|
|
||||||
Hash table entries are 16 bytes each (64-bit hash, 52-bit block number,
|
|
||||||
and some metadata bits). Each entry represents a minimum of 4K on disk.
|
|
||||||
|
|
||||||
unique data size hash table size average dedup block size
|
|
||||||
1TB 4GB 4K
|
|
||||||
1TB 1GB 16K
|
|
||||||
1TB 256MB 64K
|
|
||||||
1TB 16MB 1024K
|
|
||||||
64TB 1GB 1024K
|
|
||||||
|
|
||||||
To change the size of the hash table, use 'truncate' to change the hash
|
|
||||||
table size, delete `beescrawl.dat` so that bees will start over with a
|
|
||||||
fresh full-filesystem rescan, and restart `bees`.
|
|
||||||
|
|
||||||
Things You Might Expect That Bees Doesn't Have
|
|
||||||
----------------------------------------------
|
|
||||||
|
|
||||||
* There's no configuration file (patches welcome!). There are some tunables
|
|
||||||
hardcoded in the source that could eventually become configuration options.
|
|
||||||
There's also an incomplete option parser (patches welcome!).
|
|
||||||
|
|
||||||
* There's no way to *stop* the Bees daemon. Use SIGKILL, SIGTERM, or
|
|
||||||
Ctrl-C for now. Some of the destructors are unreachable and have never
|
|
||||||
been tested. Bees will repeat some work when restarted.
|
|
||||||
|
|
||||||
* The Bees process doesn't fork and writes its log to stdout/stderr.
|
|
||||||
A shell wrapper is required to make it behave more like a daemon.
|
|
||||||
|
|
||||||
* There's no facility to exclude any part of a filesystem (patches
|
|
||||||
welcome).
|
|
||||||
|
|
||||||
* PREALLOC extents and extents containing blocks filled with zeros will
|
|
||||||
be replaced by holes unconditionally.
|
|
||||||
|
|
||||||
* Duplicate block groups that are less than 12K in length can take 30%
|
|
||||||
of the run time while saving only 3% of the disk space. There should
|
|
||||||
be an option to just not bother with those.
|
|
||||||
|
|
||||||
* There is a lot of duplicate reading of blocks in snapshots. Bees will
|
|
||||||
scan all snapshots at close to the same time to try to get better
|
|
||||||
performance by caching, but really fixing this requires rewriting the
|
|
||||||
crawler to scan the btrfs extent tree directly instead of the subvol
|
|
||||||
FS trees.
|
|
||||||
|
|
||||||
* Block reads are currently more allocation- and CPU-intensive than they
|
|
||||||
should be, especially for filesystems on SSD where the IO overhead is
|
|
||||||
much smaller. This is a problem for power-constrained environments
|
|
||||||
(e.g. laptops with slow CPU).
|
|
||||||
|
|
||||||
* Bees can currently fragment extents when required to remove duplicate
|
|
||||||
blocks, but has no defragmentation capability yet. When possible, Bees
|
|
||||||
will attempt to work with existing extent boundaries, but it will not
|
|
||||||
aggregate blocks together from multiple extents to create larger ones.
|
|
||||||
|
|
||||||
* It is possible to resize the hash table without starting over with
|
|
||||||
a new full-filesystem scan; however, this has not been implemented yet.
|
|
||||||
|
|
||||||
Good Btrfs Feature Interactions
|
|
||||||
-------------------------------
|
|
||||||
|
|
||||||
Bees has been tested in combination with the following:
|
|
||||||
|
|
||||||
* btrfs compression (zlib, lzo, zstd), mixtures of compressed and uncompressed extents
|
|
||||||
* PREALLOC extents (unconditionally replaced with holes)
|
|
||||||
* HOLE extents and btrfs no-holes feature
|
|
||||||
* Other deduplicators, reflink copies (though Bees may decide to redo their work)
|
|
||||||
* btrfs snapshots and non-snapshot subvols (RW and RO)
|
|
||||||
* Concurrent file modification (e.g. PostgreSQL and sqlite databases, build daemons)
|
|
||||||
* all btrfs RAID profiles (people ask about this, but it's irrelevant to bees)
|
|
||||||
* IO errors during dedup (read errors will throw exceptions, Bees will catch them and skip over the affected extent)
|
|
||||||
* Filesystems mounted *with* the flushoncommit option
|
|
||||||
* 4K filesystem data block size / clone alignment
|
|
||||||
* 64-bit and 32-bit host CPUs (amd64, x86, arm)
|
|
||||||
* Large (>16M) extents
|
|
||||||
* Huge files (>1TB--although Btrfs performance on such files isn't great in general)
|
|
||||||
* filesystems up to 25T bytes, 100M+ files
|
|
||||||
* btrfs receive
|
|
||||||
* btrfs nodatacow/nodatasum inode attribute or mount option (bees skips all nodatasum files)
|
|
||||||
* open(O_DIRECT) (seems to work as well--or as poorly--with bees as with any other btrfs feature)
|
|
||||||
|
|
||||||
Bad Btrfs Feature Interactions
|
|
||||||
------------------------------
|
|
||||||
|
|
||||||
Bees has been tested in combination with the following, and various problems are known:
|
|
||||||
|
|
||||||
* bcache, lvmcache: *severe (filesystem-destroying) metadata corruption
|
|
||||||
issues* observed in testing and reported by users, apparently only when
|
|
||||||
used with bees. Plain SSD and HDD seem to be OK.
|
|
||||||
* btrfs send: sometimes aborts with an I/O error when bees changes the
|
|
||||||
data layout during a send. The send can be restarted and will work
|
|
||||||
if bees has finished processing the snapshot being sent. No data
|
|
||||||
corruption observed other than the truncated send.
|
|
||||||
* btrfs qgroups: very slow, sometimes hangs
|
|
||||||
* btrfs autodefrag mount option: hangs and high CPU usage problems
|
|
||||||
reported by users. bees cannot distinguish autodefrag activity from
|
|
||||||
normal filesystem activity and will likely try to undo the autodefrag,
|
|
||||||
so it should probably be turned off for bees in any case.
|
|
||||||
|
|
||||||
Untested Btrfs Feature Interactions
|
|
||||||
-----------------------------------
|
|
||||||
|
|
||||||
Bees has not been tested with the following, and undesirable interactions may occur:
|
|
||||||
|
|
||||||
* Non-4K filesystem data block size (should work if recompiled)
|
|
||||||
* Non-equal hash (SUM) and filesystem data block (CLONE) sizes (probably never will work)
|
|
||||||
* btrfs seed filesystems (does anyone even use those?)
|
|
||||||
* btrfs out-of-tree kernel patches (e.g. in-band dedup or encryption)
|
|
||||||
* btrfs-convert from ext2/3/4 (never tested, might run out of space or ignore significant portions of the filesystem due to sanity checks)
|
|
||||||
* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
|
|
||||||
* Filesystems mounted *without* the flushoncommit option (don't know the impact of crashes during dedup writes vs. ordinary writes)
|
|
||||||
|
|
||||||
Other Caveats
|
|
||||||
-------------
|
|
||||||
|
|
||||||
* btrfs balance will invalidate parts of the dedup hash table. Bees will
|
|
||||||
happily rebuild the table, but it will have to scan all the blocks
|
|
||||||
again.
|
|
||||||
|
|
||||||
* btrfs defrag will cause Bees to rescan the defragmented file. If it
|
|
||||||
contained duplicate blocks and other references to the original
|
|
||||||
fragmented duplicates still exist, Bees will replace the defragmented
|
|
||||||
extents with the original fragmented ones.
|
|
||||||
|
|
||||||
* Bees creates temporary files (with O_TMPFILE) and uses them to split
|
|
||||||
and combine extents elsewhere in btrfs. These will take up to 2GB
|
|
||||||
of disk space per thread during normal operation.
|
|
||||||
|
|
||||||
* Like all deduplicators, Bees will replace data blocks with metadata
|
|
||||||
references. It is a good idea to ensure there is sufficient unallocated
|
|
||||||
space (see `btrfs fi usage`) on the filesystem to allow the metadata
|
|
||||||
to multiply in size by the number of snapshots before running Bees
|
|
||||||
for the first time. Use
|
|
||||||
|
|
||||||
btrfs balance start -dusage=100,limit=N /your/filesystem
|
|
||||||
|
|
||||||
where the `limit` parameter 'N' should be calculated as follows:
|
|
||||||
|
|
||||||
* start with the current size of metadata usage (from `btrfs fi
|
|
||||||
df`) in GB, plus 1
|
|
||||||
|
|
||||||
* multiply by the proportion of disk space in subvols with
|
|
||||||
snapshots (i.e. if there are no snapshots, multiply by 0;
|
|
||||||
if all of the data is shared between at least one origin
|
|
||||||
and one snapshot subvol, multiply by 1)
|
|
||||||
|
|
||||||
* multiply by the number of snapshots (i.e. if there is only
|
|
||||||
one subvol, multiply by 0; if there are 3 snapshots and one
|
|
||||||
origin subvol, multiply by 3)
|
|
||||||
|
|
||||||
`limit = GB_metadata * (disk_space_in_snapshots / total_disk_space) * number_of_snapshots`
|
|
||||||
|
|
||||||
Monitor unallocated space to ensure that the filesystem never runs out
|
|
||||||
of metadata space (whether Bees is running or not--this is a general
|
|
||||||
btrfs requirement).
|
|
||||||
|
|
||||||
|
|
||||||
A Brief List Of Btrfs Kernel Bugs
|
|
||||||
---------------------------------
|
|
||||||
|
|
||||||
Missing features (usually not available in older LTS kernels):
|
|
||||||
|
|
||||||
* 3.13: `FILE_EXTENT_SAME` ioctl added. No way to reliably dedup with
|
|
||||||
concurrent modifications before this.
|
|
||||||
* 3.16: `SEARCH_V2` ioctl added. Bees could use `SEARCH` instead.
|
|
||||||
* 4.2: `FILE_EXTENT_SAME` no longer updates mtime, can be used at EOF.
|
|
||||||
|
|
||||||
Future features (kernel features Bees does not yet use, but may rely on
|
|
||||||
in the future):
|
|
||||||
|
|
||||||
* 4.14: `LOGICAL_INO_V2` allows userspace to create forward and backward
|
|
||||||
reference maps to entire physical extents with a single ioctl call,
|
|
||||||
and raises the limit of 2730 references per extent. Bees has not yet
|
|
||||||
been rewritten to take full advantage of these features.
|
|
||||||
|
|
||||||
Bug fixes (sometimes included in older LTS kernels):
|
|
||||||
|
|
||||||
* Bugs fixed prior to 4.4.107 are not listed here.
|
|
||||||
* 4.5: hang in the `INO_PATHS` ioctl used by Bees.
|
|
||||||
* 4.5: use-after-free in the `FILE_EXTENT_SAME` ioctl used by Bees.
|
|
||||||
* 4.6: lost inodes after a rename, crash, and log tree replay
|
|
||||||
(triggered by the fsync() while writing `beescrawl.dat`).
|
|
||||||
* 4.7: *slow backref* bug no longer triggers a softlockup panic. It still
|
|
||||||
takes too long to resolve a block address to a root/inode/offset triple.
|
|
||||||
* 4.10: reduced CPU time cost of the LOGICAL_INO ioctl and dedup
|
|
||||||
backref processing in general.
|
|
||||||
* 4.11: yet another dedup deadlock case is fixed. Alas, it is not the
|
|
||||||
last one.
|
|
||||||
* 4.14: backref performance improvements make LOGICAL_INO even faster
|
|
||||||
in the worst cases (but possibly slower in the best cases?).
|
|
||||||
* 4.14.29: WARN_ON(ref->count < 0) in fs/btrfs/backref.c triggers
|
|
||||||
almost once per second. The WARN_ON is incorrect and can be removed.
|
|
||||||
|
|
||||||
Unfixed kernel bugs (as of 4.14.34) with workarounds in Bees:
|
|
||||||
|
|
||||||
* *Deadlocks* in the kernel dedup ioctl when files are modified
|
|
||||||
immediately before dedup. `BeesTempFile::make_copy` calls `fsync()`
|
|
||||||
immediately before dedup to work around this. If the `fsync()` is
|
|
||||||
removed, the filesystem hangs within a few hours, requiring a reboot
|
|
||||||
to recover. Even with the `fsync()`, it is possible to lose the
|
|
||||||
kernel race condition and encounter a deadlock within a machine-year.
|
|
||||||
VM image workloads may trigger this faster. Over the past years
|
|
||||||
several specific deadlock cases have been fixed, but at least one
|
|
||||||
remains.
|
|
||||||
|
|
||||||
* *Bad interactions* with other Linux block layers: bcache and lvmcache
|
|
||||||
can fail spectacularly, and apparently only while running bees.
|
|
||||||
This is definitely a kernel bug, either in btrfs or the lower block
|
|
||||||
layers. Avoid using bees with these tools, or test very carefully
|
|
||||||
before deployment.
|
|
||||||
|
|
||||||
* *slow backrefs* (aka toxic extents): If the number of references to a
|
|
||||||
single shared extent within a single file grows above a few thousand,
|
|
||||||
the kernel consumes CPU for minutes at a time while holding various
|
|
||||||
locks that block access to the filesystem. Bees avoids this bug by
|
|
||||||
measuring the time the kernel spends performing certain operations
|
|
||||||
and permanently blacklisting any extent or hash where the kernel
|
|
||||||
starts to get slow. Inside Bees, such blocks are marked as 'toxic'
|
|
||||||
hash/block addresses. Linux kernel v4.14 is better but can still
|
|
||||||
have problems.
|
|
||||||
|
|
||||||
* `LOGICAL_INO` output is arbitrarily limited to 2730 references
|
|
||||||
even if more buffer space is provided for results. Once this number
|
|
||||||
has been reached, Bees can no longer replace the extent since it can't
|
|
||||||
find and remove all existing references. Bees refrains from adding
|
|
||||||
any more references after the first 2560. Offending blocks are
|
|
||||||
marked 'toxic' even if there is no corresponding performance problem.
|
|
||||||
This places an obvious limit on dedup efficiency for extremely common
|
|
||||||
blocks or filesystems with many snapshots (although this limit is
|
|
||||||
far greater than the effective limit imposed by the *slow backref* bug).
|
|
||||||
*Fixed in v4.14.*
|
|
||||||
|
|
||||||
* `LOGICAL_INO` on compressed extents returns a list of root/inode/offset
|
|
||||||
tuples matching the extent bytenr of its argument. On uncompressed
|
|
||||||
extents, any r/i/o tuple whose extent offset does not match the
|
|
||||||
argument's extent offset is discarded, i.e. only the single 4K block
|
|
||||||
matching the argument is returned, so a complete map of the extent
|
|
||||||
references requires calling `LOGICAL_INO` for every single block of
|
|
||||||
the extent. This is undesirable behavior for Bees, which wants a
|
|
||||||
list of all extent refs referencing a data extent (i.e. Bees wants
|
|
||||||
the compressed-extent behavior in all cases). *Fixed in v4.14.*
|
|
||||||
|
|
||||||
* `FILE_EXTENT_SAME` is arbitrarily limited to 16MB. This is less than
|
|
||||||
128MB which is the maximum extent size that can be created by defrag
|
|
||||||
or prealloc. Bees avoids feedback loops this can generate while
|
|
||||||
attempting to replace extents over 16MB in length.
|
|
||||||
|
|
||||||
Not really bugs, but gotchas nonetheless:
|
|
||||||
|
|
||||||
* If a process holds a directory FD open, the subvol containing the
|
|
||||||
directory cannot be deleted (`btrfs sub del` will start the deletion
|
|
||||||
process, but it will not proceed past the first open directory FD).
|
|
||||||
`btrfs-cleaner` will simply skip over the directory *and all of its
|
|
||||||
children* until the FD is closed. Bees avoids this gotcha by closing
|
|
||||||
all of the FDs in its directory FD cache every 10 btrfs transactions.
|
|
||||||
|
|
||||||
* If a file is deleted while Bees is caching an open FD to the file,
|
|
||||||
Bees continues to scan the file. For very large files (e.g. VM
|
|
||||||
images), the deletion of the file can be delayed indefinitely.
|
|
||||||
To limit this delay, Bees closes all FDs in its file FD cache every
|
|
||||||
10 btrfs transactions.
|
|
||||||
|
|
||||||
* If a snapshot is deleted, bees will generate a burst of exceptions
|
|
||||||
for references to files in the snapshot that no longer exist. This
|
|
||||||
lasts until the FD caches are cleared.
|
|
||||||
|
|
||||||
Installation
|
|
||||||
============
|
|
||||||
|
|
||||||
Bees can be installed by following one these instructions:
|
|
||||||
|
|
||||||
Arch package
|
|
||||||
------------
|
|
||||||
|
|
||||||
Bees is available in Arch Linux AUR. Install with:
|
|
||||||
|
|
||||||
`$ pacaur -S bees-git`
|
|
||||||
|
|
||||||
Gentoo package
|
|
||||||
--------------
|
|
||||||
|
|
||||||
Bees is officially available in Gentoo Portage. Just emerge a stable
|
|
||||||
version:
|
|
||||||
|
|
||||||
`$ emerge --ask bees`
|
|
||||||
|
|
||||||
or build a live version from git master:
|
|
||||||
|
|
||||||
`$ emerge --ask =bees-9999`
|
|
||||||
|
|
||||||
You can opt-out of building the support tools with
|
|
||||||
|
|
||||||
`USE="-tools" emerge ...`
|
|
||||||
|
|
||||||
If you want to start hacking on bees and contribute changes, just emerge
|
|
||||||
the live version which automatically pulls in all required development
|
|
||||||
packages.
|
|
||||||
|
|
||||||
Build from source
|
|
||||||
-----------------
|
|
||||||
|
|
||||||
Build with `make`. The build produces `bin/bees` and `lib/libcrucible.so`,
|
|
||||||
which must be copied to somewhere in `$PATH` and `$LD_LIBRARY_PATH`
|
|
||||||
on the target system respectively.
|
|
||||||
|
|
||||||
It will also generate `scripts/beesd@.service` for systemd users. This
|
|
||||||
service makes use of a helper script `scripts/beesd` to boot the service.
|
|
||||||
Both of the latter use the filesystem UUID to mount the root subvolume
|
|
||||||
within a temporary runtime directory.
|
|
||||||
|
|
||||||
### Ubuntu 16.04 - 17.04:
|
|
||||||
`$ apt -y install build-essential btrfs-tools uuid-dev markdown && make`
|
|
||||||
|
|
||||||
### Ubuntu 14.04:
|
|
||||||
You can try to carry on the work done here: https://gist.github.com/dagelf/99ee07f5638b346adb8c058ab3d57492
|
|
||||||
|
|
||||||
Packaging
|
|
||||||
---------
|
---------
|
||||||
|
|
||||||
See 'Dependencies' below. Package maintainers can pick ideas for building and
|
* Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
|
||||||
configuring the source package from the Gentoo ebuild:
|
* Incremental realtime dedupe of new data using btrfs tree search
|
||||||
|
* Works with btrfs compression - dedupe any combination of compressed and uncompressed files
|
||||||
|
* Works around btrfs filesystem structure to free more disk space
|
||||||
|
* Persistent hash table for rapid restart after shutdown
|
||||||
|
* Whole-filesystem dedupe - including snapshots
|
||||||
|
* Constant hash table size - no increased RAM usage if data set becomes larger
|
||||||
|
* Works on live data - no scheduled downtime required
|
||||||
|
* Automatic self-throttling based on system load
|
||||||
|
|
||||||
https://github.com/gentoo/gentoo/tree/master/sys-fs/bees
|
Weaknesses
|
||||||
|
----------
|
||||||
|
|
||||||
You can configure some build options by creating a file `localconf` and
|
* Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
|
||||||
adjust settings for your distribution environment there.
|
* Runs continuously as a daemon - no quick start/stop
|
||||||
|
* Requires root privilege (or `CAP_SYS_ADMIN`)
|
||||||
|
* First run may require temporary disk space for extent reorganization
|
||||||
|
* [First run may increase metadata space usage if many snapshots exist](docs/gotchas.md)
|
||||||
|
* Constant hash table size - no decreased RAM usage if data set becomes smaller
|
||||||
|
* btrfs only
|
||||||
|
|
||||||
Please also review the Makefile for additional hints.
|
Installation and Usage
|
||||||
|
----------------------
|
||||||
|
|
||||||
Dependencies
|
* [Installation](docs/install.md)
|
||||||
------------
|
* [Configuration](docs/config.md)
|
||||||
|
* [Running](docs/running.md)
|
||||||
|
* [Command Line Options](docs/options.md)
|
||||||
|
|
||||||
* C++11 compiler (tested with GCC 4.9, 6.2.0, 8.1.0)
|
Recommended Reading
|
||||||
|
-------------------
|
||||||
|
|
||||||
Sorry. I really like closures and shared_ptr, so support
|
* [bees Gotchas](docs/gotchas.md)
|
||||||
for earlier compiler versions is unlikely.
|
* [btrfs kernel bugs](docs/btrfs-kernel.md)
|
||||||
|
* [bees vs. other btrfs features](docs/btrfs-other.md)
|
||||||
|
|
||||||
* btrfs-progs (tested with 4.1..4.15.1) or libbtrfs-dev
|
More Information
|
||||||
(tested with version 4.16.1)
|
----------------
|
||||||
|
|
||||||
Needed for btrfs.h and ctree.h during compile.
|
|
||||||
Also needed by the service wrapper script.
|
|
||||||
|
|
||||||
* libuuid-dev
|
|
||||||
|
|
||||||
This library is only required for a feature that was removed after v0.1.
|
|
||||||
The lingering support code can be removed.
|
|
||||||
|
|
||||||
* Linux kernel version: *minimum* 4.4.107, *4.14.29 or later recommended*
|
|
||||||
|
|
||||||
Don't bother trying to make Bees work with kernel versions older than
|
|
||||||
4.4.107. It may appear to work, but it won't end well: there are
|
|
||||||
too many missing features and bugs (including data corruption bugs)
|
|
||||||
to work around in older kernels.
|
|
||||||
|
|
||||||
Kernel versions between 4.4.107 and 4.14.29 are usable with bees,
|
|
||||||
but bees can trigger known performance bugs and hangs in dedup-related
|
|
||||||
functions.
|
|
||||||
|
|
||||||
* markdown
|
|
||||||
|
|
||||||
* util-linux version that provides `blkid` command for the helper
|
|
||||||
script `scripts/beesd` to work
|
|
||||||
|
|
||||||
Setup
|
|
||||||
-----
|
|
||||||
|
|
||||||
If you don't want to use the helper script `scripts/beesd` to setup and
|
|
||||||
configure bees, here's how you manually setup bees.
|
|
||||||
|
|
||||||
Create a directory for bees state files:
|
|
||||||
|
|
||||||
export BEESHOME=/some/path
|
|
||||||
mkdir -p "$BEESHOME"
|
|
||||||
|
|
||||||
Create an empty hash table (your choice of size, but it must be a multiple
|
|
||||||
of 16M). This example creates a 1GB hash table:
|
|
||||||
|
|
||||||
truncate -s 1g "$BEESHOME/beeshash.dat"
|
|
||||||
chmod 700 "$BEESHOME/beeshash.dat"
|
|
||||||
|
|
||||||
bees can only process the root subvol of a btrfs (seriously--if the
|
|
||||||
argument is not the root subvol directory, Bees will just throw an
|
|
||||||
exception and stop).
|
|
||||||
|
|
||||||
Use a bind mount, and let only bees access it:
|
|
||||||
|
|
||||||
UUID=3399e413-695a-4b0b-9384-1b0ef8f6c4cd
|
|
||||||
mkdir -p /var/lib/bees/$UUID
|
|
||||||
mount /dev/disk/by-uuid/$UUID /var/lib/bees/$UUID -osubvol=/
|
|
||||||
|
|
||||||
If you don't set BEESHOME, the path ".beeshome" will be used relative
|
|
||||||
to the root subvol of the filesystem. For example:
|
|
||||||
|
|
||||||
btrfs sub create /var/lib/bees/$UUID/.beeshome
|
|
||||||
truncate -s 1g /var/lib/bees/$UUID/.beeshome/beeshash.dat
|
|
||||||
chmod 700 /var/lib/bees/$UUID/.beeshome/beeshash.dat
|
|
||||||
|
|
||||||
You can use any relative path in BEESHOME. The path will be taken
|
|
||||||
relative to the root of the deduped filesystem (in other words it can
|
|
||||||
be the name of a subvol):
|
|
||||||
|
|
||||||
export BEESHOME=@my-beeshome
|
|
||||||
btrfs sub create /var/lib/bees/$UUID/$BEESHOME
|
|
||||||
truncate -s 1g /var/lib/bees/$UUID/$BEESHOME/beeshash.dat
|
|
||||||
chmod 700 /var/lib/bees/$UUID/$BEESHOME/beeshash.dat
|
|
||||||
|
|
||||||
Configuration
|
|
||||||
-------------
|
|
||||||
|
|
||||||
There are some runtime configurable options using environment variables:
|
|
||||||
|
|
||||||
* BEESHOME: Directory containing Bees state files:
|
|
||||||
* beeshash.dat | persistent hash table. Must be a multiple of 16M.
|
|
||||||
This contains 16-byte records: 8 bytes for CRC64,
|
|
||||||
8 bytes for physical address and some metadata bits.
|
|
||||||
* beescrawl.dat | state of SEARCH_V2 crawlers. ASCII text.
|
|
||||||
* beesstats.txt | statistics and performance counters. ASCII text.
|
|
||||||
* BEESSTATUS: File containing a snapshot of current Bees state: performance
|
|
||||||
counters and current status of each thread. The file is meant to be
|
|
||||||
human readable, but understanding it probably requires reading the source.
|
|
||||||
You can watch bees run in realtime with a command like:
|
|
||||||
|
|
||||||
watch -n1 cat $BEESSTATUS
|
|
||||||
|
|
||||||
Other options (e.g. interval between filesystem crawls) can be configured
|
|
||||||
in src/bees.h or on the cmdline (see 'Command Line Options' below).
|
|
||||||
|
|
||||||
Running
|
|
||||||
-------
|
|
||||||
|
|
||||||
Reduce CPU and IO priority to be kinder to other applications sharing
|
|
||||||
this host (or raise them for more aggressive disk space recovery). If you
|
|
||||||
use cgroups, put `bees` in its own cgroup, then reduce the `blkio.weight`
|
|
||||||
and `cpu.shares` parameters. You can also use `schedtool` and `ionice`
|
|
||||||
in the shell script that launches `bees`:
|
|
||||||
|
|
||||||
schedtool -D -n20 $$
|
|
||||||
ionice -c3 -p $$
|
|
||||||
|
|
||||||
Let the bees fly:
|
|
||||||
|
|
||||||
for fs in /var/lib/bees/*-*-*-*-*/; do
|
|
||||||
bees "$fs" >> "$fs/.beeshome/bees.log" 2>&1 &
|
|
||||||
done
|
|
||||||
|
|
||||||
You'll probably want to arrange for /var/log/bees.log to be rotated
|
|
||||||
periodically. You may also want to set umask to 077 to prevent disclosure
|
|
||||||
of information about the contents of the filesystem through the log file.
|
|
||||||
|
|
||||||
There are also some shell wrappers in the `scripts/` directory.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Command Line Options
|
|
||||||
--------------------
|
|
||||||
|
|
||||||
* --thread-count (-c) COUNT
|
|
||||||
* Specify maximum number of worker threads for scanning. Overrides
|
|
||||||
--thread-factor (-C) and default/autodetected values.
|
|
||||||
* --thread-factor (-C) FACTOR
|
|
||||||
* Specify ratio of worker threads to CPU cores. Overridden by --thread-count (-c).
|
|
||||||
Default is 1.0, i.e. 1 worker thread per detected CPU. Use values
|
|
||||||
below 1.0 to leave some cores idle, or above 1.0 if there are more
|
|
||||||
disks than CPUs in the filesystem.
|
|
||||||
* --loadavg-target (-g) LOADAVG
|
|
||||||
* Specify load average target for dynamic worker threads.
|
|
||||||
Threads will be started or stopped subject to the upper limit imposed
|
|
||||||
by thread-factor, thread-min and thread-count until the load average
|
|
||||||
is within +/- 0.5 of LOADAVG.
|
|
||||||
* --thread-min (-G) COUNT
|
|
||||||
* Specify minimum number of worker threads for scanning.
|
|
||||||
Ignored unless -g option is used to specify a target load.
|
|
||||||
|
|
||||||
* --scan-mode (-m) MODE
|
|
||||||
* Specify extent scanning algorithm. Default mode is 0.
|
|
||||||
_EXPERIMENTAL_ feature that may go away.
|
|
||||||
* Mode 0: scan extents in ascending order of (inode, subvol, offset).
|
|
||||||
Keeps shared extents between snapshots together. Reads files sequentially.
|
|
||||||
Minimizes temporary space usage.
|
|
||||||
* Mode 1: scan extents from all subvols in parallel. Good performance
|
|
||||||
on non-spinning media when subvols are unrelated.
|
|
||||||
* Mode 2: scan all extents from one subvol at a time. Good sequential
|
|
||||||
read performance for spinning media. Maximizes temporary space usage.
|
|
||||||
|
|
||||||
* --timestamps (-t)
|
|
||||||
* Enable timestamps in log output.
|
|
||||||
* --no-timestamps (-T)
|
|
||||||
* Disable timestamps in log output.
|
|
||||||
* --absolute-paths (-p)
|
|
||||||
* Paths in log output will be absolute.
|
|
||||||
* --strip-paths (-P)
|
|
||||||
* Paths in log output will have the working directory at Bees startup
|
|
||||||
stripped.
|
|
||||||
* --verbose (-v)
|
|
||||||
* Set log verbosity (0 = no output, 8 = all output, default 8).
|
|
||||||
|
|
||||||
|
* [How bees works](docs/how-it-works.md)
|
||||||
|
* [Missing bees features](docs/missing.md)
|
||||||
|
|
||||||
Bug Reports and Contributions
|
Bug Reports and Contributions
|
||||||
-----------------------------
|
-----------------------------
|
||||||
@ -596,11 +65,9 @@ You can also use Github:
|
|||||||
|
|
||||||
https://github.com/Zygo/bees
|
https://github.com/Zygo/bees
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Copyright & License
|
Copyright & License
|
||||||
===================
|
-------------------
|
||||||
|
|
||||||
Copyright 2015-2017 Zygo Blaxell <bees@furryterror.org>.
|
Copyright 2015-2018 Zygo Blaxell <bees@furryterror.org>.
|
||||||
|
|
||||||
GPL (version 3 or later).
|
GPL (version 3 or later).
|
||||||
|
1
docs/.gitignore
vendored
Normal file
1
docs/.gitignore
vendored
Normal file
@ -0,0 +1 @@
|
|||||||
|
*.html
|
8
docs/Makefile
Normal file
8
docs/Makefile
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
MARKDOWN := $(firstword $(shell type -P markdown markdown2 markdown_py 2>/dev/null || echo markdown))
|
||||||
|
.PHONY: docs
|
||||||
|
|
||||||
|
docs: $(subst .md,.html,$(wildcard *.md)) ../README.html
|
||||||
|
|
||||||
|
%.html: %.md
|
||||||
|
$(MARKDOWN) $< | sed -e 's/\.md/\.html/g' > $@.new
|
||||||
|
mv -f $@.new $@
|
1
docs/_config.yml
Normal file
1
docs/_config.yml
Normal file
@ -0,0 +1 @@
|
|||||||
|
theme: jekyll-theme-cayman
|
56
docs/btrfs-kernel.md
Normal file
56
docs/btrfs-kernel.md
Normal file
@ -0,0 +1,56 @@
|
|||||||
|
Recommended kernel version
|
||||||
|
==========================
|
||||||
|
|
||||||
|
Linux **4.14.34** or later.
|
||||||
|
|
||||||
|
A Brief List Of Btrfs Kernel Bugs
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
Recent kernel bug fixes:
|
||||||
|
|
||||||
|
* 4.14.29: `WARN_ON(ref->count < 0)` in fs/btrfs/backref.c triggers
|
||||||
|
almost once per second. The `WARN_ON` is incorrect, and is now removed.
|
||||||
|
|
||||||
|
Unfixed kernel bugs (as of 4.14.71):
|
||||||
|
|
||||||
|
* **Bad _filesystem destroying_ interactions** with other Linux block
|
||||||
|
layers: `bcache` and `lvmcache` can fail spectacularly, and apparently
|
||||||
|
only do so while running bees. This is definitely a kernel bug,
|
||||||
|
either in btrfs or the lower block layers. **Avoid using bees with
|
||||||
|
these tools unless your filesystem is disposable and you intend to
|
||||||
|
debug the kernel.**
|
||||||
|
|
||||||
|
* **Compressed data corruption** is possible when using the `fallocate`
|
||||||
|
system call to punch holes into compressed extents that contain long
|
||||||
|
runs of zeros. The [bug results in intermittent corruption during
|
||||||
|
reads](https://www.spinics.net/lists/linux-btrfs/msg81293.html), but
|
||||||
|
due to the bug, the kernel might sometimes mistakenly determine data
|
||||||
|
is duplicate, and deduplication will corrupt the data permanently.
|
||||||
|
This bug also affects compressed `kvm` raw images with the `discard`
|
||||||
|
feature on btrfs or any compressed file where `fallocate -d` or
|
||||||
|
`fallocate -p` has been used.
|
||||||
|
|
||||||
|
* **Deadlock** when [simultaneously using the same files in dedupe and
|
||||||
|
`rename`](https://www.spinics.net/lists/linux-btrfs/msg81109.html).
|
||||||
|
There is no way for bees to reliably know when another process is
|
||||||
|
about to rename a file while bees is deduping it. In the `rsync` case,
|
||||||
|
bees will dedupe the new file `rsync` is creating using the old file
|
||||||
|
`rsync` is copying from, while `rsync` will rename the new file over
|
||||||
|
the old file to replace it.
|
||||||
|
|
||||||
|
Minor kernel problems with workarounds:
|
||||||
|
|
||||||
|
* **Slow backrefs** (aka toxic extents): If the number of references to a
|
||||||
|
single shared extent within a single file grows above a few thousand,
|
||||||
|
the kernel consumes CPU for minutes at a time while holding various
|
||||||
|
locks that block access to the filesystem. bees avoids this bug
|
||||||
|
by measuring the time the kernel spends performing `LOGICAL_INO`
|
||||||
|
operations and permanently blacklisting any extent or hash involved
|
||||||
|
where the kernel starts to get slow. Inside bees, such blocks are
|
||||||
|
known as 'toxic' hash/block addresses.
|
||||||
|
|
||||||
|
* **`FILE_EXTENT_SAME` is arbitrarily limited to 16MB**. This is
|
||||||
|
less than 128MB which is the maximum extent size that can be created
|
||||||
|
by defrag, prealloc, or filesystems without the `compress-force`
|
||||||
|
mount option. bees avoids feedback loops this can generate while
|
||||||
|
attempting to replace extents over 16MB in length.
|
53
docs/btrfs-other.md
Normal file
53
docs/btrfs-other.md
Normal file
@ -0,0 +1,53 @@
|
|||||||
|
Good Btrfs Feature Interactions
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
bees has been tested in combination with the following:
|
||||||
|
|
||||||
|
* btrfs compression (zlib, lzo, zstd), mixtures of compressed and uncompressed extents
|
||||||
|
* PREALLOC extents (unconditionally replaced with holes)
|
||||||
|
* HOLE extents and btrfs no-holes feature
|
||||||
|
* Other deduplicators, reflink copies (though bees may decide to redo their work)
|
||||||
|
* btrfs snapshots and non-snapshot subvols (RW and RO)
|
||||||
|
* Concurrent file modification (e.g. PostgreSQL and sqlite databases, build daemons)
|
||||||
|
* all btrfs RAID profiles
|
||||||
|
* IO errors during dedupe (read errors will throw exceptions, bees will catch them and skip over the affected extent)
|
||||||
|
* Filesystems mounted *with* the flushoncommit option (system crashes, power failures OK)
|
||||||
|
* 4K filesystem data block size / clone alignment
|
||||||
|
* 64-bit and 32-bit host CPUs (amd64, x86, arm)
|
||||||
|
* Huge files (>1TB--although Btrfs performance on such files isn't great in general)
|
||||||
|
* filesystems up to 30T+ bytes, 100M+ files
|
||||||
|
* btrfs receive
|
||||||
|
* btrfs nodatacow/nodatasum inode attribute or mount option (bees skips all nodatasum files)
|
||||||
|
* open(O_DIRECT) (seems to work as well--or as poorly--with bees as with any other btrfs feature)
|
||||||
|
|
||||||
|
Bad Btrfs Feature Interactions
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
bees has been tested in combination with the following, and various problems are known:
|
||||||
|
|
||||||
|
* bcache, lvmcache: **severe (filesystem-destroying) metadata corruption
|
||||||
|
issues** observed in testing and reported by users, apparently only when
|
||||||
|
used with bees. Plain SSD and HDD seem to be OK.
|
||||||
|
* btrfs send: some kernel versions have bugs in btrfs send that can be
|
||||||
|
triggered by bees. The send can be restarted and will work if bees
|
||||||
|
has finished processing the snapshot being sent. No data corruption
|
||||||
|
observed other than the truncated send.
|
||||||
|
* btrfs qgroups: very slow, sometimes hangs...and it's even worse when
|
||||||
|
bees is running.
|
||||||
|
* btrfs autodefrag mount option: hangs and high CPU usage problems
|
||||||
|
reported by users. bees cannot distinguish autodefrag activity from
|
||||||
|
normal filesystem activity and will likely try to undo the autodefrag
|
||||||
|
if duplicate copies of the defragmented data exist.
|
||||||
|
|
||||||
|
Untested Btrfs Feature Interactions
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
bees has not been tested with the following, and undesirable interactions may occur:
|
||||||
|
|
||||||
|
* Non-4K filesystem data block size (should work if recompiled)
|
||||||
|
* Non-equal hash (SUM) and filesystem data block (CLONE) sizes (need to fix that eventually)
|
||||||
|
* btrfs seed filesystems (does anyone even use those?)
|
||||||
|
* btrfs out-of-tree kernel patches (e.g. in-kernel dedupe or encryption)
|
||||||
|
* btrfs-convert from ext2/3/4 (never tested, might run out of space or ignore significant portions of the filesystem due to sanity checks)
|
||||||
|
* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
|
||||||
|
* Filesystems mounted *without* the flushoncommit option (don't know the data integrity impact of crashes during dedupe writes vs. ordinary writes)
|
151
docs/config.md
Normal file
151
docs/config.md
Normal file
@ -0,0 +1,151 @@
|
|||||||
|
bees Configuration
|
||||||
|
==================
|
||||||
|
|
||||||
|
The only configuration parameter that *must* be provided is the hash
|
||||||
|
table size. Other parameters are optional or hardcoded, and the defaults
|
||||||
|
are reasonable in most cases.
|
||||||
|
|
||||||
|
Hash Table Sizing
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Hash table entries are 16 bytes per data block. The hash table stores
|
||||||
|
the most recently read unique hashes. Once the hash table is full,
|
||||||
|
each new entry in the table evicts an old entry.
|
||||||
|
|
||||||
|
Here are some numbers to estimate appropriate hash table sizes:
|
||||||
|
|
||||||
|
unique data size | hash table size |average dedupe extent size
|
||||||
|
1TB | 4GB | 4K
|
||||||
|
1TB | 1GB | 16K
|
||||||
|
1TB | 256MB | 64K
|
||||||
|
1TB | 128MB | 128K <- recommended
|
||||||
|
1TB | 16MB | 1024K
|
||||||
|
64TB | 1GB | 1024K
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
|
||||||
|
* If the hash table is too large, no extra dedupe efficiency is
|
||||||
|
obtained, and the extra space just wastes RAM. Extra space can also slow
|
||||||
|
bees down by preventing old data from being evicted, so bees wastes time
|
||||||
|
looking for matching data that is no longer present on the filesystem.
|
||||||
|
|
||||||
|
* If the hash table is too small, bees extrapolates from matching
|
||||||
|
blocks to find matching adjacent blocks in the filesystem that have been
|
||||||
|
evicted from the hash table. In other words, bees only needs to find
|
||||||
|
one block in common between two extents in order to be able to dedupe
|
||||||
|
the entire extents. This provides significantly more dedupe hit rate
|
||||||
|
per hash table byte than other dedupe tools.
|
||||||
|
|
||||||
|
* When counting unique data in compressed data blocks to estimate
|
||||||
|
optimum hash table size, count the *uncompressed* size of the data.
|
||||||
|
|
||||||
|
* Another way to approach the hash table size is to simply decide how much
|
||||||
|
RAM can be spared without too much discomfort, give bees that amount of
|
||||||
|
RAM, and accept whatever dedupe hit rate occurs as a result. bees will
|
||||||
|
do the best job it can with the RAM it is given.
|
||||||
|
|
||||||
|
Factors affecting optimal hash table size
|
||||||
|
-----------------------------------------
|
||||||
|
|
||||||
|
It is difficult to predict the net effect of data layout and access
|
||||||
|
patterns on dedupe effectiveness without performing deep inspection of
|
||||||
|
both the filesystem data and its structure--a task that is as expensive
|
||||||
|
as performing the deduplication.
|
||||||
|
|
||||||
|
* **Compression** on the filesystem reduces the average extent length
|
||||||
|
compared to uncompressed filesystems. The maximum compressed extent
|
||||||
|
length on btrfs is 128KB, while the maximum uncompressed extent length
|
||||||
|
is 128MB. Longer extents decrease the optimum hash table size while
|
||||||
|
shorter extents increase the optimum hash table size because the
|
||||||
|
probability of a hash table entry being present (i.e. unevicted) in
|
||||||
|
each extent is proportional to the extent length.
|
||||||
|
|
||||||
|
As a rule of thumb, the optimal hash table size for a compressed
|
||||||
|
filesystem is 2-4x larger than the optimal hash table size for the same
|
||||||
|
data on an uncompressed filesystem. Dedupe efficiency falls dramatically
|
||||||
|
with hash tables smaller than 128MB/TB as the average dedupe extent size
|
||||||
|
is larger than the largest possible compressed extent size (128KB).
|
||||||
|
|
||||||
|
* **Short writes** also shorten the average extent length and increase
|
||||||
|
optimum hash table size. If a database writes to files randomly using
|
||||||
|
4K page writes, all of these extents will be 4K in length, and the hash
|
||||||
|
table size must be increased to retain each one (or the user must accept
|
||||||
|
a lower dedupe hit rate).
|
||||||
|
|
||||||
|
Defragmenting files that have had many short writes increases the
|
||||||
|
extent length and therefore reduces the optimum hash table size.
|
||||||
|
|
||||||
|
* **Time between duplicate writes** also affects the optimum hash table
|
||||||
|
size. bees reads data blocks in logical order during its first pass,
|
||||||
|
and after that new data blocks are read incrementally a few seconds or
|
||||||
|
minutes after they are written. bees finds more matching blocks if there
|
||||||
|
is a smaller amount of data between the matching reads, i.e. there are
|
||||||
|
fewer blocks evicted from the hash table. If most identical writes to
|
||||||
|
the filesystem occur near the same time, the optimum hash table size is
|
||||||
|
smaller. If most identical writes occur over longer intervals of time,
|
||||||
|
the optimum hash table size must be larger to avoid evicting hashes from
|
||||||
|
the table before matches are found.
|
||||||
|
|
||||||
|
For example, a build server normally writes out very similar source
|
||||||
|
code files over and over, so it will need a smaller hash table than a
|
||||||
|
backup server which has to refer to the oldest data on the filesystem
|
||||||
|
every time a new client machine's data is added to the server.
|
||||||
|
|
||||||
|
Scanning modes for multiple subvols
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
The `--scan-mode` option affects how bees divides resources between
|
||||||
|
subvolumes. This is particularly relevant when there are snapshots,
|
||||||
|
as there are tradeoffs to be made depending on how snapshots are used
|
||||||
|
on the filesystem.
|
||||||
|
|
||||||
|
Note that if a filesystem has only one subvolume (i.e. the root,
|
||||||
|
subvol ID 5) then the `--scan-mode` option has no effect, as there is
|
||||||
|
only one subvolume to scan.
|
||||||
|
|
||||||
|
The default mode is mode 0, "lockstep". In this mode, each inode of each
|
||||||
|
subvol is scanned at the same time, before moving to the next inode in
|
||||||
|
each subvol. This maximizes the likelihood that all of the references to
|
||||||
|
a snapshot of a file are scanned at the same time, which takes advantage
|
||||||
|
of VFS caching in the Linux kernel. If snapshots are created very often,
|
||||||
|
bees will not make very good progress as it constantly restarts the
|
||||||
|
filesystem scan from the beginning each time a new snapshot is created.
|
||||||
|
|
||||||
|
Scan mode 1, "independent", simply scans every subvol independently
|
||||||
|
in parallel. Each subvol's scanner shares time equally with all other
|
||||||
|
subvol scanners. Whenever a new subvol appears, a new scanner is
|
||||||
|
created and the new subvol scanner doesn't affect the behavior of any
|
||||||
|
existing subvol scanner.
|
||||||
|
|
||||||
|
Scan mode 2, "sequential", processes each subvol completely before
|
||||||
|
proceeding to the next subvol. This is a good mode when using bees for
|
||||||
|
the first time on a filesystem that already has many existing snapshots
|
||||||
|
and a high rate of new snapshot creation. Short-lived snapshots
|
||||||
|
(e.g. those used for `btrfs send`) are effectively ignored, and bees
|
||||||
|
directs its efforts toward older subvols that are more likely to be
|
||||||
|
origin subvols for snapshots. By deduping origin subvols first, bees
|
||||||
|
ensures that future snapshots will already be deduplicated and do not
|
||||||
|
need to be deduplicated again.
|
||||||
|
|
||||||
|
If you are using bees for the first time on a filesystem with many
|
||||||
|
existing snapshots, you should read about [snapshot gotchas](gotchas.md).
|
||||||
|
|
||||||
|
Threads and load management
|
||||||
|
---------------------------
|
||||||
|
|
||||||
|
By default, bees creates one worker thread for each CPU detected.
|
||||||
|
These threads then perform scanning and dedupe operations. The number of
|
||||||
|
worker threads can be set with the [`--thread-count` and `--thread-factor`
|
||||||
|
options](options.md).
|
||||||
|
|
||||||
|
If desired, bees can automatically increase or decrease the number
|
||||||
|
of worker threads in response to system load. This reduces impact on
|
||||||
|
the rest of the system by pausing bees when other CPU and IO intensive
|
||||||
|
loads are active on the system, and resumes bees when the other loads
|
||||||
|
are inactive. This is configured with the [`--loadavg-target` and
|
||||||
|
`--thread-min` options](options.md).
|
||||||
|
|
||||||
|
Log verbosity
|
||||||
|
-------------
|
||||||
|
|
||||||
|
bees can be made less chatty with the [`--verbose` option](options.md).
|
113
docs/gotchas.md
Normal file
113
docs/gotchas.md
Normal file
@ -0,0 +1,113 @@
|
|||||||
|
bees Gotchas
|
||||||
|
============
|
||||||
|
|
||||||
|
Snapshots
|
||||||
|
---------
|
||||||
|
|
||||||
|
bees can dedupe filesystems with many snapshots, but bees only does
|
||||||
|
well in this situation if bees was running on the filesystem from
|
||||||
|
the beginning.
|
||||||
|
|
||||||
|
Each time bees dedupes an extent that is referenced by a snapshot,
|
||||||
|
the entire metadata page in the snapshot subvol (16KB by default) must
|
||||||
|
be CoWed in btrfs. This can result in a substantial increase in btrfs
|
||||||
|
metadata size if there are many snapshots on a filesystem.
|
||||||
|
|
||||||
|
Normally, metadata is small (less than 1% of the filesystem) and dedupe
|
||||||
|
hit rates are large (10-40% of the filesystem), so the increase in
|
||||||
|
metadata size is offset by much larger reductions in data size and the
|
||||||
|
total space used by the entire filesystem is reduced.
|
||||||
|
|
||||||
|
If a subvol is deduped _before_ a snapshot is created, the snapshot will
|
||||||
|
have the same deduplication as the subvol. This does _not_ result in
|
||||||
|
unusually large metadata sizes. If a snapshot is made after bees has
|
||||||
|
fully scanned the origin subvol, bees can avoid scanning most of the
|
||||||
|
data in the snapshot subvol, as it will be provably identical to the
|
||||||
|
origin subvol that was already scanned.
|
||||||
|
|
||||||
|
If a subvol is deduped _after_ a snapshot is created, the origin and
|
||||||
|
snapshot subvols must be deduplicated separately. In the worst case, this
|
||||||
|
will double the amount of reading the bees scanner must perform, and will
|
||||||
|
also double the amount of btrfs metadata used for the snapshot; however,
|
||||||
|
the "worst case" is a dedupe hit rate of 1% or more, so a doubling of
|
||||||
|
metadata size is certain for all but the most unique data sets. Also,
|
||||||
|
bees will not be able to free any space until the last snapshot has been
|
||||||
|
scanned and deduped, so payoff in data space savings is deferred until
|
||||||
|
the metadata has almost finished expanding.
|
||||||
|
|
||||||
|
If a subvol is deduped after _many_ snapshots have been created, all
|
||||||
|
subvols must be deduplicated individually. In the worst case, this will
|
||||||
|
multiply the scanning work and metadata size by the number of snapshots.
|
||||||
|
For 100 snapshots this can mean a 100x growth in metadata size and
|
||||||
|
bees scanning time, which typically exceeds the possible savings from
|
||||||
|
reducing the data size by dedupe. In such cases using bees will result
|
||||||
|
in a net increase in disk space usage that persists until the snapshots
|
||||||
|
are deleted.
|
||||||
|
|
||||||
|
Snapshot case studies
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
* bees running on an empty filesystem
|
||||||
|
* filesystem is mkfsed
|
||||||
|
* bees is installed and starts running
|
||||||
|
* data is written to the filesystem
|
||||||
|
* bees dedupes the data as it appears
|
||||||
|
* a snapshot is made of the data
|
||||||
|
* The snapshot will already be 99% deduped, so the metadata will
|
||||||
|
not expand very much because only 1% of the data in the snapshot
|
||||||
|
must be deduped.
|
||||||
|
* more snapshots are made of the data
|
||||||
|
* as long as dedupe has been completed on the origin subvol,
|
||||||
|
bees will quickly scan each new snapshot because it can skip
|
||||||
|
all the previously scanned data. Metadata usage remains low
|
||||||
|
(it may even shrink because there are fewer csums).
|
||||||
|
|
||||||
|
* bees installed on a non-empty filesystem with snapshots
|
||||||
|
* filesystem is mkfsed
|
||||||
|
* data is written to the filesystem
|
||||||
|
* multiple snapshots are made of the data
|
||||||
|
* bees is installed and starts running
|
||||||
|
* bees dedupes each snapshot individually
|
||||||
|
* The snapshot metadata will no longer be shared, resulting in
|
||||||
|
substantial growth of metadata usage.
|
||||||
|
* Disk space savings do not occur until bees processes the
|
||||||
|
last snapshot reference to data.
|
||||||
|
|
||||||
|
|
||||||
|
Other Gotchas
|
||||||
|
-------------
|
||||||
|
|
||||||
|
* bees avoids the [slow backrefs kernel bug](btrfs-kernel.md) by
|
||||||
|
measuring the time required to perform `LOGICAL_INO` operations. If an
|
||||||
|
extent requires over 10 seconds to perform a `LOGICAL_INO` then bees
|
||||||
|
blacklists the extent and avoids referencing it in future operations.
|
||||||
|
In most cases, fewer than 0.1% of extents in a filesystem must be
|
||||||
|
avoided this way. This results in short write latency spikes of up
|
||||||
|
to and a little over 10 seconds as btrfs will not allow writes to the
|
||||||
|
filesystem while `LOGICAL_INO` is running. Generally the CPU spends
|
||||||
|
most of the runtime of the `LOGICAL_INO` ioctl running the kernel,
|
||||||
|
so on a single-core CPU the entire system can freeze up for a few
|
||||||
|
seconds at a time.
|
||||||
|
|
||||||
|
* Load managers that send a `SIGSTOP` to the bees process to throttle
|
||||||
|
CPU usage may affect the `LOGICAL_INO` timing mechanism, causing extents
|
||||||
|
to be incorrectly labelled 'toxic'. This will cause a small reduction
|
||||||
|
of dedupe hit rate. Slow and heavily loaded disks can trigger the same
|
||||||
|
effect if `LOGICAL_INO` takes too long due to IO latency.
|
||||||
|
|
||||||
|
* If a process holds a directory FD open, the subvol containing the
|
||||||
|
directory cannot be deleted (`btrfs sub del` will start the deletion
|
||||||
|
process, but it will not proceed past the first open directory FD).
|
||||||
|
`btrfs-cleaner` will simply skip over the directory *and all of its
|
||||||
|
children* until the FD is closed. bees avoids this gotcha by closing
|
||||||
|
all of the FDs in its directory FD cache every 10 btrfs transactions.
|
||||||
|
|
||||||
|
* If a file is deleted while bees is caching an open FD to the file,
|
||||||
|
bees continues to scan the file. For very large files (e.g. VM
|
||||||
|
images), the deletion of the file can be delayed indefinitely.
|
||||||
|
To limit this delay, bees closes all FDs in its file FD cache every
|
||||||
|
10 btrfs transactions.
|
||||||
|
|
||||||
|
* If a snapshot is deleted, bees will generate a burst of exceptions
|
||||||
|
for references to files in the snapshot that no longer exist. This
|
||||||
|
lasts until the FD caches are cleared.
|
100
docs/how-it-works.md
Normal file
100
docs/how-it-works.md
Normal file
@ -0,0 +1,100 @@
|
|||||||
|
How bees Works
|
||||||
|
--------------
|
||||||
|
|
||||||
|
bees is a daemon designed to run continuously and maintain its state
|
||||||
|
across crashes and reboots.
|
||||||
|
|
||||||
|
bees uses checkpoints for persistence to eliminate the IO overhead of a
|
||||||
|
transactional data store. On restart, bees will dedupe any data that
|
||||||
|
was added to the filesystem since the last checkpoint. Checkpoints
|
||||||
|
occur every 15 minutes for scan progress, stored in `beescrawl.dat`.
|
||||||
|
The hash table trickle-writes to disk at 4GB/hour to `beeshash.dat`.
|
||||||
|
An hourly performance report is written to `beesstats.txt`. There are
|
||||||
|
no special requirements for bees hash table storage--`.beeshome` could
|
||||||
|
be stored on a different btrfs filesystem, ext4, or even CIFS.
|
||||||
|
|
||||||
|
bees uses a persistent dedupe hash table with a fixed size configured
|
||||||
|
by the user. Any size of hash table can be dedicated to dedupe. If a
|
||||||
|
fast dedupe with low hit rate is desired, bees can use a hash table as
|
||||||
|
small as 16MB.
|
||||||
|
|
||||||
|
The bees hash table is loaded into RAM at startup and `mlock`ed so it
|
||||||
|
will not be swapped out by the kernel (if swap is permitted, performance
|
||||||
|
degrades to nearly zero).
|
||||||
|
|
||||||
|
bees scans the filesystem in a single pass which removes duplicate
|
||||||
|
extents immediately after they are detected. There are no distinct
|
||||||
|
scanning and dedupe phases, so bees can start recovering free space
|
||||||
|
immediately after startup.
|
||||||
|
|
||||||
|
Once a filesystem scan has been completed, bees uses the `min_transid`
|
||||||
|
parameter of the `TREE_SEARCH_V2` ioctl to avoid rescanning old data
|
||||||
|
on future scans and quickly scan new data. An incremental data scan
|
||||||
|
can complete in less than a millisecond on an idle filesystem.
|
||||||
|
|
||||||
|
Once a duplicate data block is identified, bees examines the nearby
|
||||||
|
blocks in the files where the matched block appears. This allows bees
|
||||||
|
to find long runs of adjacent duplicate block pairs if it has an entry
|
||||||
|
for any one of the blocks in its hash table. On typical data sets,
|
||||||
|
this means most of the blocks in the hash table are redundant and can
|
||||||
|
be discarded without significant impact on dedupe hit rate.
|
||||||
|
|
||||||
|
Hash table entries are grouped together into LRU lists. As each block
|
||||||
|
is scanned, its hash table entry is inserted into the LRU list at a
|
||||||
|
random position. If the LRU list is full, the entry at the end of the
|
||||||
|
list is deleted. If a hash table entry is used to discover duplicate
|
||||||
|
blocks, the entry is moved to the beginning of the list. This makes bees
|
||||||
|
unable to detect a small number of duplicates, but it dramatically
|
||||||
|
improves efficiency on filesystems with many small files.
|
||||||
|
|
||||||
|
Once the hash table fills up, old entries are evicted by new entries.
|
||||||
|
This means that the optimum hash table size is determined by the
|
||||||
|
distance between duplicate blocks on the filesystem rather than the
|
||||||
|
filesystem unique data size. Even if the hash table is too small
|
||||||
|
to find all duplicates, it may still find _most_ of them, especially
|
||||||
|
during incremental scans where the data in many workloads tends to be
|
||||||
|
more similar.
|
||||||
|
|
||||||
|
When a duplicate block pair is found in two btrfs extents, bees will
|
||||||
|
attempt to match all other blocks in the newer extent with blocks in
|
||||||
|
the older extent (i.e. the goal is to keep the extent referenced in the
|
||||||
|
hash table and remove the most recently scanned extent). If this is
|
||||||
|
possible, then the new extent will be replaced with a reference to the
|
||||||
|
old extent. If this is not possible, then bees will create a temporary
|
||||||
|
copy of the unmatched data in the new extent so that the entire new
|
||||||
|
extent can be removed by deduplication. This must be done because btrfs
|
||||||
|
cannot partially overwrite extents--the _entire_ extent must be replaced.
|
||||||
|
The temporary copy is then scanned during the next pass bees makes over
|
||||||
|
the filesystem for potential duplication of other extents.
|
||||||
|
|
||||||
|
When a block containing all-zero bytes is found, bees dedupes the extent
|
||||||
|
against a temporary file containing a hole, possibly creating temporary
|
||||||
|
copies of any non-zero data in the extent for later deduplication as
|
||||||
|
described above. If the extent is compressed, bees avoids splitting
|
||||||
|
the extent in the middle as this generally has a negative impact on
|
||||||
|
compression ratio (and also triggers a [kernel bug](btrfs-kernel.md)).
|
||||||
|
|
||||||
|
bees does not store any information about filesystem structure, so
|
||||||
|
its performance is linear in the number or size of files. The hash
|
||||||
|
table stores physical block numbers which are converted into paths
|
||||||
|
and FDs on demand through btrfs `SEARCH_V2` and `LOGICAL_INO` ioctls.
|
||||||
|
This eliminates the storage required to maintain the equivalents
|
||||||
|
of these functions in userspace, at the expense of encountering [some
|
||||||
|
kernel bugs in `LOGICAL_INO` performance](btrfs-kernel.md).
|
||||||
|
|
||||||
|
bees uses only the data-safe `FILE_EXTENT_SAME` (aka `FIDEDUPERANGE`)
|
||||||
|
kernel operations to manipulate user data, so it can dedupe live data
|
||||||
|
(e.g. build servers, sqlite databases, VM disk images). It does not
|
||||||
|
modify file attributes or timestamps.
|
||||||
|
|
||||||
|
When bees has scanned all of the data, bees will pause until 10
|
||||||
|
transactions have been completed in the btrfs filesystem. bees tracks
|
||||||
|
the current btrfs transaction ID over time so that it polls less often
|
||||||
|
on quiescent filesystems and more often on busy filesystems.
|
||||||
|
|
||||||
|
Scanning and deduplication work is performed by worker threads. If the
|
||||||
|
[`--loadavg-target` option](options.md) is used, bees adjusts the number
|
||||||
|
of worker threads up or down as required to have a user-specified load
|
||||||
|
impact on the system. The maximum and minimum number of threads is
|
||||||
|
configurable. If the system load is too high then bees will stop until
|
||||||
|
the load falls to acceptable levels.
|
73
docs/index.md
Normal file
73
docs/index.md
Normal file
@ -0,0 +1,73 @@
|
|||||||
|
BEES
|
||||||
|
====
|
||||||
|
|
||||||
|
Best-Effort Extent-Same, a btrfs deduplication agent.
|
||||||
|
|
||||||
|
About bees
|
||||||
|
----------
|
||||||
|
|
||||||
|
bees is a block-oriented userspace deduplication agent designed for large
|
||||||
|
btrfs filesystems. It is an offline dedupe combined with an incremental
|
||||||
|
data scan capability to minimize time data spends on disk from write
|
||||||
|
to dedupe.
|
||||||
|
|
||||||
|
Strengths
|
||||||
|
---------
|
||||||
|
|
||||||
|
* Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
|
||||||
|
* Incremental realtime dedupe of new data using btrfs tree search
|
||||||
|
* Works with btrfs compression - dedupe any combination of compressed and uncompressed files
|
||||||
|
* Works around btrfs filesystem structure to free more disk space
|
||||||
|
* Persistent hash table for rapid restart after shutdown
|
||||||
|
* Whole-filesystem dedupe - including snapshots
|
||||||
|
* Constant hash table size - no increased RAM usage if data set becomes larger
|
||||||
|
* Works on live data - no scheduled downtime required
|
||||||
|
* Automatic self-throttling based on system load
|
||||||
|
|
||||||
|
Weaknesses
|
||||||
|
----------
|
||||||
|
|
||||||
|
* Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
|
||||||
|
* Runs continuously as a daemon - no quick start/stop
|
||||||
|
* Requires root privilege (or `CAP_SYS_ADMIN`)
|
||||||
|
* First run may require temporary disk space for extent reorganization
|
||||||
|
* [First run may increase metadata space usage if many snapshots exist](gotchas.md)
|
||||||
|
* Constant hash table size - no decreased RAM usage if data set becomes smaller
|
||||||
|
* btrfs only
|
||||||
|
|
||||||
|
Installation and Usage
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
* [Installation](install.md)
|
||||||
|
* [Configuration](config.md)
|
||||||
|
* [Running](running.md)
|
||||||
|
* [Command Line Options](options.md)
|
||||||
|
|
||||||
|
Recommended Reading
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
* [bees Gotchas](gotchas.md)
|
||||||
|
* [btrfs kernel bugs](btrfs-kernel.md)
|
||||||
|
* [bees vs. other btrfs features](btrfs-other.md)
|
||||||
|
|
||||||
|
More Information
|
||||||
|
----------------
|
||||||
|
|
||||||
|
* [How bees works](how-it-works.md)
|
||||||
|
* [Missing bees features](missing.md)
|
||||||
|
|
||||||
|
Bug Reports and Contributions
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
Email bug reports and patches to Zygo Blaxell <bees@furryterror.org>.
|
||||||
|
|
||||||
|
You can also use Github:
|
||||||
|
|
||||||
|
https://github.com/Zygo/bees
|
||||||
|
|
||||||
|
Copyright & License
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
Copyright 2015-2018 Zygo Blaxell <bees@furryterror.org>.
|
||||||
|
|
||||||
|
GPL (version 3 or later).
|
91
docs/install.md
Normal file
91
docs/install.md
Normal file
@ -0,0 +1,91 @@
|
|||||||
|
Building bees
|
||||||
|
=============
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
------------
|
||||||
|
|
||||||
|
* C++11 compiler (tested with GCC 4.9, 6.2.0, 8.1.0)
|
||||||
|
|
||||||
|
Sorry. I really like closures and shared_ptr, so support
|
||||||
|
for earlier compiler versions is unlikely.
|
||||||
|
|
||||||
|
* btrfs-progs (tested with 4.1..4.15.1) or libbtrfs-dev
|
||||||
|
(tested with version 4.16.1)
|
||||||
|
|
||||||
|
Needed for btrfs.h and ctree.h during compile.
|
||||||
|
Also needed by the service wrapper script.
|
||||||
|
|
||||||
|
* libuuid-dev
|
||||||
|
|
||||||
|
This library is only required for a feature that was removed after v0.1.
|
||||||
|
The lingering support code can be removed.
|
||||||
|
|
||||||
|
* [Linux kernel version](btrfs-kernel.md) gets its own page.
|
||||||
|
|
||||||
|
* markdown for documentation
|
||||||
|
|
||||||
|
* util-linux version that provides `blkid` command for the helper
|
||||||
|
script `scripts/beesd` to work
|
||||||
|
|
||||||
|
Installation
|
||||||
|
============
|
||||||
|
|
||||||
|
bees can be installed by following one these instructions:
|
||||||
|
|
||||||
|
Arch package
|
||||||
|
------------
|
||||||
|
|
||||||
|
bees is available in Arch Linux AUR. Install with:
|
||||||
|
|
||||||
|
`$ pacaur -S bees-git`
|
||||||
|
|
||||||
|
Gentoo package
|
||||||
|
--------------
|
||||||
|
|
||||||
|
bees is officially available in Gentoo Portage. Just emerge a stable
|
||||||
|
version:
|
||||||
|
|
||||||
|
`$ emerge --ask bees`
|
||||||
|
|
||||||
|
or build a live version from git master:
|
||||||
|
|
||||||
|
`$ emerge --ask =bees-9999`
|
||||||
|
|
||||||
|
You can opt-out of building the support tools with
|
||||||
|
|
||||||
|
`USE="-tools" emerge ...`
|
||||||
|
|
||||||
|
If you want to start hacking on bees and contribute changes, just emerge
|
||||||
|
the live version which automatically pulls in all required development
|
||||||
|
packages.
|
||||||
|
|
||||||
|
Build from source
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Build with `make`. The build produces `bin/bees` and `lib/libcrucible.so`,
|
||||||
|
which must be copied to somewhere in `$PATH` and `$LD_LIBRARY_PATH`
|
||||||
|
on the target system respectively.
|
||||||
|
|
||||||
|
It will also generate `scripts/beesd@.service` for systemd users. This
|
||||||
|
service makes use of a helper script `scripts/beesd` to boot the service.
|
||||||
|
Both of the latter use the filesystem UUID to mount the root subvolume
|
||||||
|
within a temporary runtime directory.
|
||||||
|
|
||||||
|
### Ubuntu 16.04 - 17.04:
|
||||||
|
`$ apt -y install build-essential btrfs-tools uuid-dev markdown && make`
|
||||||
|
|
||||||
|
### Ubuntu 14.04:
|
||||||
|
You can try to carry on the work done here: <https://gist.github.com/dagelf/99ee07f5638b346adb8c058ab3d57492>
|
||||||
|
|
||||||
|
Packaging
|
||||||
|
---------
|
||||||
|
|
||||||
|
See 'Dependencies' below. Package maintainers can pick ideas for building and
|
||||||
|
configuring the source package from the Gentoo ebuild:
|
||||||
|
|
||||||
|
<https://github.com/gentoo/gentoo/tree/master/sys-fs/bees>
|
||||||
|
|
||||||
|
You can configure some build options by creating a file `localconf` and
|
||||||
|
adjust settings for your distribution environment there.
|
||||||
|
|
||||||
|
Please also review the Makefile for additional hints.
|
50
docs/missing.md
Normal file
50
docs/missing.md
Normal file
@ -0,0 +1,50 @@
|
|||||||
|
Features You Might Expect That bees Doesn't Have
|
||||||
|
------------------------------------------------
|
||||||
|
|
||||||
|
* There's no configuration file (patches welcome!). There are
|
||||||
|
some tunables hardcoded in the source that could eventually become
|
||||||
|
configuration options. There's also an incomplete option parser
|
||||||
|
(patches welcome!).
|
||||||
|
|
||||||
|
* There's no way to *stop* the bees daemon. Use SIGKILL, SIGTERM, or
|
||||||
|
Ctrl-C for now. Some of the destructors are unreachable and have never
|
||||||
|
been tested. bees checkpoints its progress every 15 minutes (also not
|
||||||
|
configurable, patches welcome) and will repeat some work when restarted.
|
||||||
|
|
||||||
|
* The bees process doesn't fork and writes its log to stdout/stderr.
|
||||||
|
A shell wrapper is required to make it behave more like a daemon.
|
||||||
|
|
||||||
|
* There's no facility to exclude any part of a filesystem or focus on
|
||||||
|
specific files (patches welcome).
|
||||||
|
|
||||||
|
* PREALLOC extents and extents containing blocks filled with zeros will
|
||||||
|
be replaced by holes. There is no way to turn this off.
|
||||||
|
|
||||||
|
* Consecutive runs of duplicate blocks that are less than 12K in length
|
||||||
|
can take 30% of the processing time while saving only 3% of the disk
|
||||||
|
space. There should be an option to just not bother with those, but it's
|
||||||
|
complicated by the btrfs requirement to always dedupe complete extents.
|
||||||
|
|
||||||
|
* There is a lot of duplicate reading of blocks in snapshots. bees will
|
||||||
|
scan all snapshots at close to the same time to try to get better
|
||||||
|
performance by caching, but really fixing this requires rewriting the
|
||||||
|
crawler to scan the btrfs extent tree directly instead of the subvol
|
||||||
|
FS trees.
|
||||||
|
|
||||||
|
* Block reads are currently more allocation- and CPU-intensive than they
|
||||||
|
should be, especially for filesystems on SSD where the IO overhead is
|
||||||
|
much smaller. This is a problem for CPU-power-constrained environments
|
||||||
|
(e.g. laptops running from battery, or ARM devices with slow CPU).
|
||||||
|
|
||||||
|
* bees can currently fragment extents when required to remove duplicate
|
||||||
|
blocks, but has no defragmentation capability yet. When possible, bees
|
||||||
|
will attempt to work with existing extent boundaries, but it will not
|
||||||
|
aggregate blocks together from multiple extents to create larger ones.
|
||||||
|
|
||||||
|
* When bees fragments an extent, the copied data is compressed. There
|
||||||
|
is currently no way (other than by modifying the source) to select a
|
||||||
|
compression method or not compress the data (patches welcome!).
|
||||||
|
|
||||||
|
* It is theoretically possible to resize the hash table without starting
|
||||||
|
over with a new full-filesystem scan; however, this feature has not been
|
||||||
|
implemented yet.
|
52
docs/options.md
Normal file
52
docs/options.md
Normal file
@ -0,0 +1,52 @@
|
|||||||
|
# bees Command Line Options
|
||||||
|
|
||||||
|
<table border>
|
||||||
|
<tr><th width="20%">--thread-count COUNT</th><th width="5%">-c</th>
|
||||||
|
<td>Specify maximum number of worker threads for scanning. Overrides
|
||||||
|
--thread-factor (-C) and default/autodetected values.
|
||||||
|
</td></tr>
|
||||||
|
<tr><th>--thread-factor FACTOR</th><th>-C</th>
|
||||||
|
<td>Specify ratio of worker threads to CPU cores. Overridden by --thread-count (-c).
|
||||||
|
Default is 1.0, i.e. 1 worker thread per detected CPU. Use values
|
||||||
|
below 1.0 to leave some cores idle, or above 1.0 if there are more
|
||||||
|
disks than CPUs in the filesystem.
|
||||||
|
</td></tr>
|
||||||
|
|
||||||
|
<tr><th>--loadavg-target LOADAVG</th><th>-g</th>
|
||||||
|
<td>Specify load average target for dynamic worker threads.
|
||||||
|
Threads will be started or stopped subject to the upper limit imposed
|
||||||
|
by thread-factor, thread-min and thread-count until the load average
|
||||||
|
is within +/- 0.5 of LOADAVG.
|
||||||
|
</td></tr>
|
||||||
|
<tr><th>--thread-min COUNT</th><th>-G</th>
|
||||||
|
<td>Specify minimum number of worker threads for scanning.
|
||||||
|
Ignored unless -g option is used to specify a target load.</td></tr>
|
||||||
|
|
||||||
|
<tr><th>--scan-mode MODE</th><th>-m</th>
|
||||||
|
<td>
|
||||||
|
Specify extent scanning algorithm. Default mode is 0.
|
||||||
|
<em>EXPERIMENTAL</em> feature that may go away.
|
||||||
|
<ul>
|
||||||
|
<li> Mode 0: scan extents in ascending order of (inode, subvol, offset).
|
||||||
|
Keeps shared extents between snapshots together. Reads files sequentially.
|
||||||
|
Minimizes temporary space usage.</li>
|
||||||
|
<li> Mode 1: scan extents from all subvols in parallel. Good performance
|
||||||
|
on non-spinning media when subvols are unrelated.</li>
|
||||||
|
<li> Mode 2: scan all extents from one subvol at a time. Good sequential
|
||||||
|
read performance for spinning media. Maximizes temporary space usage.</li>
|
||||||
|
</ul>
|
||||||
|
</td></tr>
|
||||||
|
|
||||||
|
<tr><th>--timestamps</th><th>-t</th>
|
||||||
|
<td>Enable timestamps in log output.</td></tr>
|
||||||
|
<tr><th>--no-timestamps</th><th>-T</th>
|
||||||
|
<td>Disable timestamps in log output.</td></tr>
|
||||||
|
<tr><th>--absolute-paths</th><th>-p</th>
|
||||||
|
<td>Paths in log output will be absolute.</td></tr>
|
||||||
|
<tr><th>--strip-paths</th><th>-P</th>
|
||||||
|
<td>Paths in log output will have the working directory at bees startup
|
||||||
|
stripped.</td></tr>
|
||||||
|
<tr><th>--verbose</th><th>-v</th>
|
||||||
|
<td>Set log verbosity (0 = no output, 8 = all output, default 8).</td></tr>
|
||||||
|
|
||||||
|
</table>
|
92
docs/running.md
Normal file
92
docs/running.md
Normal file
@ -0,0 +1,92 @@
|
|||||||
|
Running bees
|
||||||
|
============
|
||||||
|
|
||||||
|
Setup
|
||||||
|
-----
|
||||||
|
|
||||||
|
If you don't want to use the helper script `scripts/beesd` to setup and
|
||||||
|
configure bees, here's how you manually setup bees.
|
||||||
|
|
||||||
|
Create a directory for bees state files:
|
||||||
|
|
||||||
|
export BEESHOME=/some/path
|
||||||
|
mkdir -p "$BEESHOME"
|
||||||
|
|
||||||
|
Create an empty hash table ([your choice of size](config.md), but it
|
||||||
|
must be a multiple of 16MB). This example creates a 1GB hash table:
|
||||||
|
|
||||||
|
truncate -s 1g "$BEESHOME/beeshash.dat"
|
||||||
|
chmod 700 "$BEESHOME/beeshash.dat"
|
||||||
|
|
||||||
|
bees can _only_ process the root subvol of a btrfs with nothing mounted
|
||||||
|
over top. If the bees argument is not the root subvol directory, bees
|
||||||
|
will just throw an exception and stop.
|
||||||
|
|
||||||
|
Use a separate mount point, and let only bees access it:
|
||||||
|
|
||||||
|
UUID=3399e413-695a-4b0b-9384-1b0ef8f6c4cd
|
||||||
|
mkdir -p /var/lib/bees/$UUID
|
||||||
|
mount /dev/disk/by-uuid/$UUID /var/lib/bees/$UUID -osubvol=/
|
||||||
|
|
||||||
|
If you don't set BEESHOME, the path "`.beeshome`" will be used relative
|
||||||
|
to the root subvol of the filesystem. For example:
|
||||||
|
|
||||||
|
btrfs sub create /var/lib/bees/$UUID/.beeshome
|
||||||
|
truncate -s 1g /var/lib/bees/$UUID/.beeshome/beeshash.dat
|
||||||
|
chmod 700 /var/lib/bees/$UUID/.beeshome/beeshash.dat
|
||||||
|
|
||||||
|
You can use any relative path in `BEESHOME`. The path will be taken
|
||||||
|
relative to the root of the deduped filesystem (in other words it can
|
||||||
|
be the name of a subvol):
|
||||||
|
|
||||||
|
export BEESHOME=@my-beeshome
|
||||||
|
btrfs sub create /var/lib/bees/$UUID/$BEESHOME
|
||||||
|
truncate -s 1g /var/lib/bees/$UUID/$BEESHOME/beeshash.dat
|
||||||
|
chmod 700 /var/lib/bees/$UUID/$BEESHOME/beeshash.dat
|
||||||
|
|
||||||
|
Configuration
|
||||||
|
-------------
|
||||||
|
|
||||||
|
There are some runtime configurable options using environment variables:
|
||||||
|
|
||||||
|
* BEESHOME: Directory containing bees state files:
|
||||||
|
* beeshash.dat | persistent hash table. Must be a multiple of 16MB, and must be created before bees starts.
|
||||||
|
* beescrawl.dat | state of SEARCH_V2 crawlers. ASCII text. bees will create this.
|
||||||
|
* beesstats.txt | statistics and performance counters. ASCII text. bees will create this.
|
||||||
|
* BEESSTATUS: File containing a snapshot of current bees state: performance
|
||||||
|
counters and current status of each thread. The file is meant to be
|
||||||
|
human readable, but understanding it probably requires reading the source.
|
||||||
|
You can watch bees run in realtime with a command like:
|
||||||
|
|
||||||
|
watch -n1 cat $BEESSTATUS
|
||||||
|
|
||||||
|
Other options (e.g. interval between filesystem crawls) can be configured
|
||||||
|
in `src/bees.h` or [on the command line](options.md).
|
||||||
|
|
||||||
|
Running
|
||||||
|
-------
|
||||||
|
|
||||||
|
Reduce CPU and IO priority to be kinder to other applications sharing
|
||||||
|
this host (or raise them for more aggressive disk space recovery). If you
|
||||||
|
use cgroups, put `bees` in its own cgroup, then reduce the `blkio.weight`
|
||||||
|
and `cpu.shares` parameters. You can also use `schedtool` and `ionice`
|
||||||
|
in the shell script that launches `bees`:
|
||||||
|
|
||||||
|
schedtool -D -n20 $$
|
||||||
|
ionice -c3 -p $$
|
||||||
|
|
||||||
|
You can also use the [`--loadavg-target` and `--thread-min`
|
||||||
|
options](options.md) to further control the impact of bees on the rest
|
||||||
|
of the system.
|
||||||
|
|
||||||
|
Let the bees fly:
|
||||||
|
|
||||||
|
for fs in /var/lib/bees/*-*-*-*-*/; do
|
||||||
|
bees "$fs" >> "$fs/.beeshome/bees.log" 2>&1 &
|
||||||
|
done
|
||||||
|
|
||||||
|
You'll probably want to arrange for `/var/log/bees.log` to be rotated
|
||||||
|
periodically. You may also want to set umask to 077 to prevent disclosure
|
||||||
|
of information about the contents of the filesystem through the log file.
|
||||||
|
|
||||||
|
There are also some shell wrappers in the `scripts/` directory.
|
Loading…
x
Reference in New Issue
Block a user