docs: update documentation for new 'recent' scan mode

Also attempted to clarify the descriptions of the modes based on feedback and questions from users over the years. Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-12-12 06:33:38 +01:00 · 2021-11-15 23:43:02 -05:00
parent 03f809bf22
commit 984ceeb2a5
2 changed files with 74 additions and 36 deletions
--- a/docs/config.md
+++ b/docs/config.md
@@ -94,38 +94,76 @@ every time a new client machine's data is added to the server.
 Scanning modes for multiple subvols
 -----------------------------------
-The `--scan-mode` option affects how bees divides resources between
+The `--scan-mode` option affects how bees schedules worker threads
-subvolumes.  This is particularly relevant when there are snapshots,
+between subvolumes.  Scan modes are an experimental feature and will
-as there are tradeoffs to be made depending on how snapshots are used
+likely be deprecated in favor of a better solution.
 on the filesystem.
-Note that if a filesystem has only one subvolume (i.e. the root,
+Scan mode can be changed at any time by restarting bees with a different
-subvol ID 5) then the `--scan-mode` option has no effect, as there is
+mode option.  Scan state tracking is the same for all of the currently
-only one subvolume to scan.
+implemented modes.  The difference between the modes is the order in
 which subvols are selected.
-The default mode is mode 0, "lockstep".  In this mode, each inode of each
+If a filesystem has only one subvolume with data in it, then the
-subvol is scanned at the same time, before moving to the next inode in
+`--scan-mode` option has no effect.  In this case, there is only one
-each subvol.  This maximizes the likelihood that all of the references to
+subvolume to scan, so worker threads will all scan that one.
 a snapshot of a file are scanned at the same time, which takes advantage
 of VFS caching in the Linux kernel.  If snapshots are created very often,
 bees will not make very good progress as it constantly restarts the
 filesystem scan from the beginning each time a new snapshot is created.
-Scan mode 1, "independent", simply scans every subvol independently
+Within a subvol, there is a single optimal scan order:  files are scanned
-in parallel.  Each subvol's scanner shares time equally with all other
+in ascending numerical inode order.  Each worker will scan a different
-subvol scanners.  Whenever a new subvol appears, a new scanner is
+inode to avoid having the threads contend with each other for locks.
-created and the new subvol scanner doesn't affect the behavior of any
+File data is read sequentially and in order, but old blocks from earlier
-existing subvol scanner.
+scans are skipped.
-Scan mode 2, "sequential", processes each subvol completely before
+Between subvols, there are several scheduling algorithms with different
-proceeding to the next subvol.  This is a good mode when using bees for
+trade-offs:
-the first time on a filesystem that already has many existing snapshots
+
-and a high rate of new snapshot creation.  Short-lived snapshots
+Scan mode 0, "lockstep", scans the same inode number in each subvol at
-(e.g. those used for `btrfs send`) are effectively ignored, and bees
+close to the same time.  This is useful if the subvols are snapshots
-directs its efforts toward older subvols that are more likely to be
+with a common ancestor, since the same inode number in each subvol will
-origin subvols for snapshots.  By deduping origin subvols first, bees
+have similar or identical contents.  This maximizes the likelihood
-ensures that future snapshots will already be deduplicated and do not
+that all of the references to a snapshot of a file are scanned at
-need to be deduplicated again.
+close to the same time, improving dedupe hit rate and possibly taking
 advantage of VFS caching in the Linux kernel.  If the subvols are
 unrelated (i.e. not snapshots of a single subvol) then this mode does
 not provide significant benefit over random selection.  This mode uses
 smaller amounts of temporary space for shorter periods of time when most
 subvols are snapshots.  When a new snapshot is created, this mode will
 stop scanning other subvols and scan the new snapshot until the same
 inode number is reached in each subvol, which will effectively stop
 dedupe temporarily as this data has already been scanned and deduped
 in the other snapshots.
 Scan mode 1, "independent", scans the next inode with new data in each
 subvol.  Each subvol's scanner shares inodes uniformly with all other
 subvol scanners until the subvol has no new inodes left.  This mode makes
 continuous forward progress across the filesystem and provides average
 performance across a variety of workloads, but is slow to respond to new
 data, and may spend a lot of time deduping short-lived subvols that will
 soon be deleted when it is preferable to dedupe long-lived subvols that
 will be the origin of future snapshots.  When a new snapshot is created,
 previous subvol scans continue as before, but the time is now divided
 among one more subvol.
 Scan mode 2, "sequential", scans one subvol at a time, in numerical subvol
 ID order, processing each subvol completely before proceeding to the
 next subvol.  This avoids spending time scanning short-lived snapshots
 that will be deleted before they can be fully deduped (e.g. those used
 for `btrfs send`).  Scanning is concentrated on older subvols that are
 more likely to be origin subvols for future snapshots, eliminating the
 need to dedupe future snapshots separately.  This mode uses the largest
 amount of temporary space for the longest time, and typically requires
 a larger hash table to maintain dedupe hit rate.
 Scan mode 3, "recent", scans the subvols with the highest `min_transid`
 value first (i.e. the ones that were most recently completely scanned),
 then the highest `max_transid` (i.e. the ones that were created later),
 then falls back to "independent" mode to break ties.  This interrupts
 long scans of old subvols to give a rapid dedupe response to new data,
 then returns to the old subvols after the new data is scanned.  It is
 useful for large filesystems with multiple active subvols and rotating
 snapshots, where the first-pass scan can take months, but new duplicate
 data appears every day.
 The default scan mode is 1, "independent".
 If you are using bees for the first time on a filesystem with many
 existing snapshots, you should read about [snapshot gotchas](gotchas.md).
--- a/docs/options.md
+++ b/docs/options.md
@@ -40,16 +40,16 @@
 * `--scan-mode MODE` or `-m`
- Specify extent scanning algorithm.  Default `MODE` is 0.
+ Specify extent scanning algorithm.  Default `MODE` is 3.
 **EXPERIMENTAL** feature that may go away.
-  * Mode 0: scan extents in ascending order of (inode, subvol, offset).
+  * Mode 0: lockstep
-  Keeps shared extents between snapshots together.  Reads files sequentially.
+  * Mode 1: independent
-  Minimizes temporary space usage.
+  * Mode 2: sequential
-  * Mode 1: scan extents from all subvols in parallel.  Good performance
+  * Mode 3: recent
-  on non-spinning media when subvols are unrelated.
+
-  * Mode 2: scan all extents from one subvol at a time.  Good sequential
+ For details of the different scanning modes, see
-  read performance for spinning media.  Maximizes temporary space usage.
+ [bees configuration](docs/config.md).
 ## Workarounds