README.md: answer some questions that came in after release

2025-10-16 07:03:36 +02:00 · 2016-11-17 13:49:53 -05:00
parent 74de78947d
commit 876b76d761
1 changed files with 42 additions and 26 deletions
--- a/README.md
+++ b/README.md
@@ -7,19 +7,18 @@ About Bees
 ----------
 Bees is a daemon designed to run continuously on live file servers.
-Bees consumes entire filesystems and deduplicates in a single pass, using
+Bees scans and deduplicates whole filesystems in a single pass instead
-minimal RAM to store data.  Bees maintains persistent state so it can be
+of separate scan and dedup phases.  RAM usage does _not_ depend on
-interrupted and resumed, whether by planned upgrades or unplanned crashes.
+unique data size or the number of input files.  Hash tables and scan
-Bees makes continuous incremental progress instead of using separate
+progress are stored persistently so the daemon can resume after a reboot.
-scan and dedup phases.  Bees uses the Linux kernel's `dedupe_file_range`
+Bees uses the Linux kernel's `dedupe_file_range` feature to ensure data
-system call to ensure data is handled safely even if other applications
+is handled safely even if other applications concurrently modify it.
 concurrently modify it.
 Bees is intentionally btrfs-specific for performance and capability.
-Bees uses the btrfs `SEARCH_V2` ioctl to scan for new data 
+Bees uses the btrfs `SEARCH_V2` ioctl to scan for new data without the
-without the overhead of repeatedly walking filesystem trees with the
+overhead of repeatedly walking filesystem trees with the POSIX API.
-POSIX API.  Bees uses `LOGICAL_INO` and `INO_PATHS` to leverage btrfs's
+Bees uses `LOGICAL_INO` and `INO_PATHS` to leverage btrfs's existing
-existing metadata instead of building its own redundant data structures.
+metadata instead of building its own redundant data structures.
 Bees can cope with Btrfs filesystem compression.  Bees can reassemble
 Btrfs extents to deduplicate extents that contain a mix of duplicate
 and unique data blocks.
@@ -37,7 +36,8 @@ using a weighted sampling algorithm.  This allows Bees to adapt itself
 to its filesystem size without forcing admins to do math at install time.
 At the same time, the duplicate block alignment constraint can be as low
 as 4K, allowing efficient deduplication of files with narrowly-aligned
-duplicate block offsets (e.g. compiled binaries and VM/disk images).
+duplicate block offsets (e.g. compiled binaries and VM/disk images)
 even if the effective block size is much larger.
 The Bees hash table is loaded into RAM at startup (using hugepages if
 available), mlocked, and synced to persistent storage by trickle-writing
@@ -78,6 +78,12 @@ and some metadata bits).  Each entry represents a minimum of 4K on disk.
        1TB                16MB               1024K
       64TB                 1GB               1024K
 It is possible to resize the hash table by changing the size of
 `beeshash.dat` (e.g. with `truncate`) and restarting `bees`.  This
 does not preserve all the existing hash table entries, but it does
 preserve more than zero of them--especially if the old and new sizes
 are a power-of-two multiple of each other.
 Things You Might Expect That Bees Doesn't Have
 ----------------------------------------------
@@ -113,6 +119,16 @@ this was removed because it made Bees too aggressive to coexist with
 other applications on the same machine.  It also hit the *slow backrefs*
 on N CPU cores instead of just one.
 * Block reads are currently more allocation- and CPU-intensive than they
 should be, especially for filesystems on SSD where the IO overhead is
 much smaller.  This is a problem for power-constrained environments
 (e.g. laptops with slow CPU).
 * Bees can currently fragment extents when required to remove duplicate
 blocks, but has no defragmentation capability yet.  When possible, Bees
 will attempt to work with existing extent boundaries, but it will not
 aggregate blocks together from multiple extents to create larger ones.
 Good Btrfs Feature Interactions
 -------------------------------
@@ -340,9 +356,9 @@ Use a bind mount, and let only bees access it:
 Reduce CPU and IO priority to be kinder to other applications
 sharing this host (or raise them for more aggressive disk space
-recovery).  If you use cgroups, put bees in its own cgroup, then reduce
+recovery).  If you use cgroups, put `bees` in its own cgroup, then reduce
 the `blkio.weight` and `cpu.shares` parameters.  You can also use
-`schedtool` and `ionice in the shell script that launches bees:
+`schedtool` and `ionice` in the shell script that launches `bees`:
        schedtool -D -n20 $$
        ionice -c3 -p $$