From 74de78947dc753b333f8f49c38107120821411bc Mon Sep 17 00:00:00 2001
From: Zygo Blaxell <zblaxell@faye.furryterror.org>
Date: Wed, 16 Nov 2016 16:04:26 -0500
Subject: [PATCH] README: more docs

---
 README.md | 337 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 312 insertions(+), 25 deletions(-)

diff --git a/README.md b/README.md
index b5abeaf..78d3eea 100644
--- a/README.md
+++ b/README.md
@@ -1,33 +1,289 @@
 BEES
 ====
 
-Best-Effort Extent-Same, a btrfs deduplicator.
+Best-Effort Extent-Same, a btrfs deduplication daemon.
 
-TODO
-----
+About Bees
+----------
 
-Write some docs here:
+Bees is a daemon designed to run continuously on live file servers.
+Bees consumes entire filesystems and deduplicates in a single pass, using
+minimal RAM to store data.  Bees maintains persistent state so it can be
+interrupted and resumed, whether by planned upgrades or unplanned crashes.
+Bees makes continuous incremental progress instead of using separate
+scan and dedup phases.  Bees uses the Linux kernel's `dedupe_file_range`
+system call to ensure data is handled safely even if other applications
+concurrently modify it.
 
-* copyright (Zygo Blaxell 2015-2016), license (GPL3+)
-* what it is
-* what it isn't
-* building it
-* what works
-* what doesn't work
-* a brief history of btrfs kernel bugs
-* things that could have been, and why they aren't
-* roadmap (and anti-roadmap)
-* how to report bugs
-* how to contribute
+Bees is intentionally btrfs-specific for performance and capability.
+Bees uses the btrfs `SEARCH_V2` ioctl to scan for new data 
+without the overhead of repeatedly walking filesystem trees with the
+POSIX API.  Bees uses `LOGICAL_INO` and `INO_PATHS` to leverage btrfs's
+existing metadata instead of building its own redundant data structures.
+Bees can cope with Btrfs filesystem compression.  Bees can reassemble
+Btrfs extents to deduplicate extents that contain a mix of duplicate
+and unique data blocks.
+
+Bees includes a number of workarounds for Btrfs kernel bugs to (try to)
+avoid ruining your day.  You're welcome.
+
+How Bees Works
+--------------
+
+Bees uses a fixed-size persistent dedup hash table with a variable dedup
+block size.  Any size of hash table can be dedicated to dedup.  Bees will
+scale the dedup block size to fit the filesystem's unique data size
+using a weighted sampling algorithm.  This allows Bees to adapt itself
+to its filesystem size without forcing admins to do math at install time.
+At the same time, the duplicate block alignment constraint can be as low
+as 4K, allowing efficient deduplication of files with narrowly-aligned
+duplicate block offsets (e.g. compiled binaries and VM/disk images).
+
+The Bees hash table is loaded into RAM at startup (using hugepages if
+available), mlocked, and synced to persistent storage by trickle-writing
+over a period of several hours.  This avoids issues related to seeking
+or fragmentation, and enables the hash table to be efficiently stored
+on Btrfs with compression (or an ext4 filesystem, or a raw disk, or
+on CIFS...).
+
+Once a duplicate block is identified, Bees examines the nearby blocks
+in the files where block appears.  This allows Bees to find long runs
+of adjacent duplicate block pairs if it has an entry for any one of
+the blocks in its hash table.  The stored hash entry plus the block
+recently scanned from disk form a duplicate pair.  On typical data sets,
+this means most of the blocks in the hash table are redundant and can
+be discarded without significant performance impact.
+
+Hash table entries are grouped together into LRU lists.  As each block
+is scanned, its hash table entry is inserted into the LRU list at a
+random position.  If the LRU list is full, the entry at the end of the
+list is deleted.  If a hash table entry is used to discover duplicate
+blocks, the entry is moved to the beginning of the list.  This makes Bees
+unable to detect a small number of duplicates (less than 1% on typical
+filesystems), but it dramatically improves efficiency on filesystems
+with many small files.  Bees has found a net 13% more duplicate bytes
+than a naive fixed-block-size algorithm with a 64K block size using the
+same size of hash table, even after discarding 1% of the duplicate bytes.
+
+Hash Table Sizing
+-----------------
+
+Hash table entries are 16 bytes each (64-bit hash, 52-bit block number,
+and some metadata bits).  Each entry represents a minimum of 4K on disk.
+
+    unique data size    hash table size    average dedup block size
+        1TB                 4GB                  4K
+        1TB                 1GB                 16K
+        1TB               256MB                 64K
+        1TB                16MB               1024K
+       64TB                 1GB               1024K
+
+Things You Might Expect That Bees Doesn't Have
+----------------------------------------------
+
+* There's no configuration file or getopt command line option processing
+(patches welcome!).  There are some tunables hardcoded in the source
+that could eventually become configuration options.
+
+* There's no way to *stop* the Bees daemon.  Use SIGKILL, SIGTERM, or
+Ctrl-C for now.  Some of the destructors are unreachable and have never
+been tested.  Bees will repeat some work when restarted.
+
+* The Bees process doesn't fork and writes its log to stdout/stderr.
+A shell wrapper is required to make it behave more like a daemon.
+
+* There's no facility to exclude any part of a filesystem (patches
+welcome).
+
+* PREALLOC extents and extents containing blocks filled with zeros will
+be replaced by holes unconditionally.
+
+* Duplicate block groups that are less than 12K in length can take 30%
+of the run time while saving only 3% of the disk space.  There should
+be an option to just not bother with those.
+
+* There is a lot of duplicate reading of blocks in snapshots.  Bees will
+scan all snapshots at close to the same time to try to get better
+performance by caching, but really fixing this requires rewriting the
+crawler to scan the btrfs extent tree directly instead of the subvol
+FS trees.
+
+* Bees had support for multiple worker threads in the past; however,
+this was removed because it made Bees too aggressive to coexist with
+other applications on the same machine.  It also hit the *slow backrefs*
+on N CPU cores instead of just one.
+
+Good Btrfs Feature Interactions
+-------------------------------
+
+Bees has been tested in combination with the following:
+
+* btrfs compression (either method), mixtures of compressed and uncompressed extents
+* PREALLOC extents (unconditionally replaced with holes)
+* HOLE extents and btrfs no-holes feature
+* Other deduplicators, reflink copies (though Bees may decide to redo their work)
+* btrfs snapshots and non-snapshot subvols (RW only)
+* Concurrent file modification (e.g. PostgreSQL and sqlite databases, build daemons)
+* all btrfs RAID profiles (people ask about this, but it's irrelevant)
+* IO errors during dedup (read errors will throw exceptions, Bees will catch them and skip over the affected extent)
+* Filesystems mounted *with* the flushoncommit option
+* 4K filesystem data block size / clone alignment
+* 64-bit CPUs (amd64)
+* Large (>16M) extents
+* Huge files (>1TB--although Btrfs performance on such files isn't great in general)
+* filesystems up to 25T bytes, 100M+ files
+
+
+Bad Btrfs Feature Interactions
+------------------------------
+
+Bees has not been tested with the following, and undesirable interactions may occur:
+
+* Non-4K filesystem data block size (should work if recompiled)
+* 32-bit CPUs (x86, arm)
+* Non-equal hash (SUM) and filesystem data block (CLONE) sizes (probably never will work)
+* btrfs read-only snapshots (never tested, probably wouldn't work well)
+* btrfs send/receive (receive is probably OK, but send requires RO snapshots.  See above)
+* btrfs qgroups (never tested, no idea what might happen)
+* btrfs seed filesystems (does anyone even use those?)
+* btrfs autodefrag mount option (never tested, could fight with Bees)
+* btrfs nodatacow mount option or inode attribute (*could* work, but might not)
+* btrfs out-of-tree kernel patches (e.g. in-band dedup or encryption)
+* btrfs-convert from ext2/3/4 (never tested)
+* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
+* open(O_DIRECT)
+* Filesystems mounted *without* the flushoncommit option
+
+Other Caveats
+-------------
+
+* btrfs balance will invalidate parts of the dedup table.  Bees will
+  happily rebuild the table, but it will have to scan all the blocks
+  again.
+
+* btrfs defrag will cause Bees to rescan the defragmented file.  If it
+  contained duplicate blocks and other references to the original
+  fragmented duplicates still exist, Bees will replace the defragmented
+  extents with the original fragmented ones.
+
+* Bees creates temporary files (with O_TMPFILE) and uses them to split
+  and combine extents elsewhere in btrfs.  These will take up to 2GB
+  during normal operation.
+
+* Like all deduplicators, Bees will replace data blocks with metadata
+  references.  It is a good idea to ensure there are several GB of
+  unallocated space (see `btrfs fi df`) on the filesystem before running
+  Bees for the first time.  Use
+
+	btrfs balance start -dusage=100,limit=1 /your/filesystem
+
+  If possible, raise the `limit` parameter to the current size of metadata
+  usage (from `btrfs fi df`) plus 1.
+
+
+A Brief List Of Btrfs Kernel Bugs
+---------------------------------
+
+Fixed bugs:
+
+* 3.13: `FILE_EXTENT_SAME` ioctl added.  No way to reliably dedup with
+  concurrent modifications before this.
+* 3.16: `SEARCH_V2` ioctl added.  Bees could use `SEARCH` instead.
+* 4.2: `FILE_EXTENT_SAME` no longer updates mtime, can be used at EOF.
+  Kernel deadlock bugs fixed.
+* 4.7: *slow backref* bug no longer triggers a softlockup panic.  It still
+  too long to resolve a block address to a root/inode/offset triple.
+
+Unfixed kernel bugs (as of 4.5.7) with workarounds in Bees:
+
+* *slow backref*: If the number of references to a single shared extent
+  within a single file grows above a few thousand, the kernel consumes CPU
+  for up to 40 uninterruptible minutes while holding various locks that
+  block access to the filesystem.  Bees avoids this bug by measuring the
+  time the kernel spends performing certain operations and permanently
+  blacklisting any extent or hash where the kernel starts to get slow.
+  Inside Bees, such blocks are marked as 'toxic' hash/block addresses.
+
+* `LOGICAL_INO` output is arbitrarily limited to 2730 references
+  even if more buffer space is provided for results.  Once this number
+  has been reached, Bees can no longer replace the extent since it can't
+  find and remove all existing references.  Bees refrains from adding
+  any more references after the first 2560.  Offending blocks are
+  marked 'toxic' even if there is no corresponding performance problem.
+  This places an obvious limit on dedup efficiency for extremely common
+  blocks or filesystems with many snapshots (although this limit is
+  far greater than the effective limit imposed by the *slow backref* bug).
+
+* `FILE_EXTENT_SAME` is arbitrarily limited to 16MB.  This is less than
+  128MB which is the maximum extent size that can be created by defrag
+  or prealloc.  Bees avoids feedback loops this can generate while
+  attempting to replace extents over 16MB in length.
+
+* `DEFRAG_RANGE` is useless.  The ioctl attempts to implement `btrfs
+  fi defrag` in the kernel, and will arbitrarily defragment more or
+  less than the range requested to match the behavior expected from the
+  userspace tool.  Bees implements its own defrag instead, copying data
+  to a temporary file and using the `FILE_EXTENT_SAME` ioctl to replace
+  precisely the specified range of offending fragmented blocks.
+
+* When writing BeesStringFile, a crash can cause the directory entry
+  `beescrawl.UUID.dat.tmp` to exist without a corresponding inode.
+  This directory entry cannot be renamed or removed; however, it does
+  not prevent the creation of a second directory entry with the same
+  name that functions normally, so it doesn't prevent Bees operation.
+
+  The orphan directory entry can be removed by deleting its subvol,
+  so place BEESHOME on a separate subvol so you can delete these orphan
+  directory entries when they occur (or use btrfs zero-log before mounting
+  the filesystem after a crash).
+
+* If the fsync() BeesTempFile::make_copy is removed, the filesystem
+  hangs within a few hours, requiring a reboot to recover.
+
+Not really a bug, but a gotcha nonetheless:
+
+* If a process holds a directory FD open, the subvol containing the
+  directory cannot be deleted (`btrfs sub del` will start the deletion
+  process, but it will not proceed past the first open directory FD).
+  `btrfs-cleaner` will simply skip over the directory *and all of its
+  children* until the FD is closed.  Bees avoids this gotcha by closing
+  all of the FDs in its directory FD cache every 15 minutes.
+
+
+
+Requirements
+------------
+
+* C++11 compiler (tested with GCC 4.9)
+
+  Sorry.  I really like closures.
+
+* btrfs-progs (tested with 4.1..4.7)
+
+  Needed for btrfs.h and ctree.h during compile.
+  Not needed at runtime.
+
+* libuuid-dev
+
+  TODO: remove the one function used from this library.
+  It supports a feature Bees no longer implements.
+
+* Linux kernel 4.2 or later
+
+  Don't bother trying to make Bees work with older kernels.
+  It won't end well.
+
+* 64-bit host and target CPU
+
+  This code has never been tested on a 32-bit target CPU.
+
+  A 64-bit host CPU may be required for the self-tests.
+  Some of the ioctls don't work properly with a 64-bit
+  kernel and 32-bit userspace.
 
 Build
 -----
 
-Requirements:
-	* C++11 compiler (I use GCC 4.9)
-	* btrfs-progs (I've used 4.1..4.7) for /usr/include/btrfs/*
-	* libuuid-dev (TODO: remove the one function we call from this library)
-
 Build with `make`.
 
 The build produces `bin/bees` and `lib/libcrucible.so`, which must be
@@ -66,23 +322,54 @@ in src/bees.h.
 Running
 -------
 
-We created this directory in the previous section.
+We created this directory in the previous section:
 
 	export BEESHOME=/some/path
 
-Use a tmpfs for BEESSTATUS, it updates once per second
+Use a tmpfs for BEESSTATUS, it updates once per second:
 
 	export BEESSTATUS=/run/bees.status
 
-bees can only process the root subvol of a btrfs.
-Use a bind mount, and let only bees access it.
+bees can only process the root subvol of a btrfs (seriously--if the
+argument is not the root subvol directory, Bees will just throw an
+exception and stop).
+
+Use a bind mount, and let only bees access it:
 
 	mount -osubvol=/ /dev/<your-filesystem> /var/lib/bees/root
 
-Let the bees fly!
+Reduce CPU and IO priority to be kinder to other applications
+sharing this host (or raise them for more aggressive disk space
+recovery).  If you use cgroups, put bees in its own cgroup, then reduce
+the `blkio.weight` and `cpu.shares` parameters.  You can also use
+`schedtool` and `ionice in the shell script that launches bees:
+
+	schedtool -D -n20 $$
+	ionice -c3 -p $$
+
+Let the bees fly:
 
 	bees /var/lib/bees/root >> /var/log/bees.log 2>&1
 
 You'll probably want to arrange for /var/log/bees.log to be rotated
 periodically.  You may also want to set umask to 077 to prevent disclosure
 of information about the contents of the filesystem through the log file.
+
+
+Bug Reports and Contributions
+-----------------------------
+
+Email bug reports and patches to Zygo Blaxell <bees@furryterror.org>.
+
+You can also use Github:
+
+	https://github.com/Zygo/bees
+
+
+
+Copyright & License
+===================
+
+Copyright 2015-2016 Zygo Blaxell <bees@furryterror.org>.
+
+GPL (version 3 or later).