readahead: flush the readahead cache based on time, not extent count

If the extent wasn't read in the last second, chances are high that it was evicted from the page cache. If the extents have been evicted from the cache by the time we grow or dedupe them, we'll take a serious performance hit as we read them back in, one page at a time. Use a 5-second delay to match the default writeback interval. Signed-off-by: Zygo Blaxell <bees@furryterror.org>
readahead: ignore large and unproductive readahead requests
2026-01-08 20:00:22 +01:00 · 2025-07-22 00:06:11 -04:00 · 2025-07-21 21:21:54 -04:00 · 2025-07-21 21:21:54 -04:00 · 2025-07-21 21:21:54 -04:00 · 2025-07-21 21:21:54 -04:00
53 changed files with 3863 additions and 1235 deletions
@@ -4,6 +4,7 @@ define TEMPLATE_COMPILER =
 sed $< >$@ \
 		-e's#@DESTDIR@#$(DESTDIR)#' \
 		-e's#@PREFIX@#$(PREFIX)#' \
 		-e's#@BINDIR@#$(BINDIR)#' \
 		-e's#@ETC_PREFIX@#$(ETC_PREFIX)#' \
 		-e's#@LIBEXEC_PREFIX@#$(LIBEXEC_PREFIX)#'
 endef
@@ -1,6 +1,7 @@
 PREFIX ?= /usr
 ETC_PREFIX ?= /etc
 LIBDIR ?= lib
 BINDIR ?= sbin
 LIB_PREFIX ?= $(PREFIX)/$(LIBDIR)
 LIBEXEC_PREFIX ?= $(LIB_PREFIX)/bees
@@ -55,7 +56,7 @@ install_bees: src $(RUN_INSTALL_TESTS)
 install_scripts: ## Install scipts
 install_scripts: scripts
-	install -Dm755 scripts/beesd $(DESTDIR)$(PREFIX)/sbin/beesd
+	install -Dm755 scripts/beesd $(DESTDIR)$(PREFIX)/$(BINDIR)/beesd
 	install -Dm644 scripts/beesd.conf.sample $(DESTDIR)$(ETC_PREFIX)/bees/beesd.conf.sample
 ifneq ($(SYSTEMD_SYSTEM_UNIT_DIR),)
 	install -Dm644 scripts/beesd@.service $(DESTDIR)$(SYSTEMD_SYSTEM_UNIT_DIR)/beesd@.service
@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
 About bees
 ----------
-bees is a block-oriented userspace deduplication agent designed for large
+bees is a block-oriented userspace deduplication agent designed to scale
-btrfs filesystems.  It is an offline dedupe combined with an incremental
+up to large btrfs filesystems.  It is an offline dedupe combined with
-data scan capability to minimize time data spends on disk from write
+an incremental data scan capability to minimize time data spends on disk
-to dedupe.
+from write to dedupe.
 Strengths
 ---------
- * Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
+ * Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
- * Daemon incrementally dedupes new data using btrfs tree search
+ * Daemon mode - incrementally dedupes new data as it appears
 * Largest extents first - recover more free space during fixed maintenance windows
 * Works with btrfs compression - dedupe any combination of compressed and uncompressed files
- * Works around btrfs filesystem structure to free more disk space
+ * Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
 * Persistent hash table for rapid restart after shutdown
 * Whole-filesystem dedupe - including snapshots
 * Constant hash table size - no increased RAM usage if data set becomes larger
 * Works on live data - no scheduled downtime required
- * Automatic self-throttling based on system load
+ * Automatic self-throttling - reduces system load
 * btrfs support - recovers more free space from btrfs than naive dedupers
 Weaknesses
 ----------
 * Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
- * Requires root privilege (or `CAP_SYS_ADMIN`)
+ * Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
 * First run may require temporary disk space for extent reorganization
 * [First run may increase metadata space usage if many snapshots exist](docs/gotchas.md)
 * Constant hash table size - no decreased RAM usage if data set becomes smaller
 * btrfs only
@@ -46,7 +46,7 @@ Recommended Reading
 -------------------
 * [bees Gotchas](docs/gotchas.md)
- * [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING
+ * [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
 * [bees vs. other btrfs features](docs/btrfs-other.md)
 * [What to do when something goes wrong](docs/wrong.md)
@@ -69,6 +69,6 @@ You can also use Github:
 Copyright & License
 -------------------
-Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
+Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.
 GPL (version 3 or later).
@@ -1,31 +1,24 @@
-Recommended Kernel Version for bees
+Recommended Linux Kernel Version for bees
-===================================
+=========================================
-First, a warning that is not specific to bees:
+First, a warning about old Linux kernel versions:
-> **Kernel 5.1, 5.2, and 5.3 should not be used with btrfs due to a
+> **Linux kernel version 5.1, 5.2, and 5.3 should not be used with btrfs
-severe regression that can lead to fatal metadata corruption.**
+due to a severe regression that can lead to fatal metadata corruption.**
-This issue is fixed in kernel 5.4.14 and later.
+This issue is fixed in version 5.4.14 and later.
-**Recommended kernel versions for bees are 4.19, 5.4, 5.10, 5.11, 5.15,
+**Recommended Linux kernel versions for bees are 5.4, 5.10, 5.15, 6.1,
-6.0, or 6.1, with recent LTS and -stable updates.**  The latest released
+6.6, or 6.12 with recent LTS and -stable updates.**  The latest released
-kernel as of this writing is 6.4.1.
+kernel as of this writing is 6.12.9, and the earliest supported LTS
 kernel is 5.4.
-4.14, 4.9, and 4.4 LTS kernels with recent updates are OK with some
+Some optional bees features use kernel APIs introduced in kernel 4.15
-issues.  Older kernels will be slower (a little slower or a lot slower
+(extent scan) and 5.6 (`openat2` support).  These bees features are not
-depending on which issues are triggered).  Not all fixes are backported.
+available on older kernels.  Support for older kernels may be removed
-
+in a future bees release.
 Obsolete non-LTS kernels have a variety of unfixed issues and should
 not be used with btrfs.  For details see the table below.
 bees requires btrfs kernel API version 4.2 or higher, and does not work
 at all on older kernels.
 Some bees features rely on kernel 4.15 to work, and these features will
 not be available on older kernels.  Currently, bees is still usable on
 older kernels with degraded performance or with options disabled, but
 support for older kernels may be removed.
 bees will not run at all on kernels before 4.2 due to lack of minimal
 API support.
@@ -62,14 +55,17 @@ These bugs are particularly popular among bees users, though not all are specifi
 | 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.4.125, 5.10.43, 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
 | - | 5.11 | tree mod log issue #5 | 4.4.263, 4.9.263, 4.14.227, 4.19.183, 5.4.108, 5.10.26, 5.11.9, 5.12 and later | dbcc7d57bffc btrfs: fix race when cloning extent buffer during rewind of an old root
 | - | 5.12 | tree mod log issue #6 | 4.14.233, 4.19.191, 5.4.118, 5.10.36, 5.11.20, 5.12.3, 5.13 and later | f9690f426b21 btrfs: fix race when picking most recent mod log operation for an old root
 | 5.11 | 5.12 | subvols marked for deletion with `btrfs sub del` become permanently undeletable ("ghost" subvols) | 5.12 stopped creation of new ghost subvols | Partially fixed in 8d488a8c7ba2 btrfs: fix subvolume/snapshot deletion not triggered on mount.  Qu wrote a [patch](https://github.com/adam900710/linux/commit/9de990fcc8864c376eb28aa7482c54321f94acd4) to allow `btrfs sub del -i` to remove "ghost" subvols, but it was never merged upstream.
 | 4.15 | 5.16 | spurious warnings from `fs/fs-writeback.c` when `flushoncommit` is enabled | 5.15.27, 5.16.13, 5.17 and later | a0f0cf8341e3 btrfs: get rid of warning on transaction commit when using flushoncommit
 | - | 5.17 | crash during device removal can make filesystem unmountable | 5.15.54, 5.16.20, 5.17.3, 5.18 and later | bbac58698a55 btrfs: remove device item and update super block in the same transaction
 | - | 5.18 | wrong superblock num_devices makes filesystem unmountable | 4.14.283, 4.19.247, 5.4.198, 5.10.121, 5.15.46, 5.17.14, 5.18.3, 5.19 and later | d201238ccd2f btrfs: repair super block num_devices automatically
 | 5.18 | 5.19 | parent transid verify failed during log tree replay after a crash during a rename operation | 5.18.18, 5.19.2, 6.0 and later | 723df2bcc9e1 btrfs: join running log transaction when logging new name
 | 5.12 | 6.0 | space cache corruption and potential double allocations | 5.15.65, 5.19.6, 6.0 and later | ced8ecf026fd btrfs: fix space cache corruption and potential double allocations
 | 6.0 | 6.5 | suboptimal allocation in multi-device filesystems due to chunk allocator regression | 6.1.60, 6.5.9, 6.6 and later | 8a540e990d7d btrfs: fix stripe length calculation for non-zoned data chunk allocation
 | 6.3, backported to 5.15.107, 6.1.24, 6.2.11 | 6.3 | vmalloc error, failed to allocate pages | 6.3.10, 6.4 and later.  Bug (f349b15e183d "mm: vmalloc: avoid warn_alloc noise caused by fatal signal" in v6.3-rc6) backported to 6.1.24, 6.2.11, and 5.15.107. | 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
 | 6.2 | 6.3 | `IGNORE_OFFSET` flag ignored in `LOGICAL_INO` ioctl | 6.2.16, 6.3.3, 6.4 and later | 0cad8f14d70c btrfs: fix backref walking not returning all inode refs
-| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe ioctl on the same extent | - | workaround: avoid doing that
+| 6.10 | 6.11 | `adding refs to an existing tree ref`, `failed to run delayed ref`, then read-only | 6.11.10, 6.12 and later | 7d493a5ecc26 btrfs: fix incorrect comparison for delayed refs
 | 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe/clone ioctl on the same extent | - | workaround: avoid doing that
 "Last bad kernel" refers to that version's last stable update from
 kernel.org.  Distro kernels may backport additional fixes.  Consult
@@ -95,12 +91,12 @@ contains the last committed component of the fix.
 Workarounds for known kernel bugs
 ---------------------------------
-* **Hangs with concurrent `LOGICAL_INO` and dedupe**:  on all
+* **Hangs with concurrent `LOGICAL_INO` and dedupe/clone**:  on all
-  kernel versions so far, multiple threads running `LOGICAL_INO`
+  kernel versions so far, multiple threads running `LOGICAL_INO` and
-  and dedupe ioctls at the same time on the same inodes or extents
+  dedupe/clone ioctls at the same time on the same inodes or extents
  can lead to a kernel hang.  The kernel enters an infinite loop in
  `add_all_parents`, where `count` is 0, `ref->count` is 1, and
-  `btrfs_next_item` or `btrfs_next_old_item` never find a matching ref).
+  `btrfs_next_item` or `btrfs_next_old_item` never find a matching ref.
  bees has two workarounds for this bug: 1. schedule work so that multiple
  threads do not simultaneously access the same inode or the same extent,
@@ -121,58 +117,32 @@ Workarounds for known kernel bugs
  It is still theoretically possible to trigger the kernel bug when
  running bees at the same time as other dedupers, or other programs
-  that use `LOGICAL_INO` like `btdu`; however, it's extremely difficult
+  that use `LOGICAL_INO` like `btdu`, or when performing a reflink clone
-  to reproduce the bug without closely cooperating threads.
+  operation such as `cp` or `mv`; however, it's extremely difficult to
  reproduce the bug without closely cooperating threads.
-* **Slow backrefs** (aka toxic extents):  Under certain conditions,
+* **Slow backrefs** (aka toxic extents):  On older kernels, under certain
-  if the number of references to a single shared extent grows too
+  conditions, if the number of references to a single shared extent grows
-  high, the kernel consumes more and more CPU while also holding locks
+  too high, the kernel consumes more and more CPU while also holding
-  that delay write access to the filesystem.  bees avoids this bug
+  locks that delay write access to the filesystem.  This is no longer
-  by measuring the time the kernel spends performing `LOGICAL_INO`
+  a concern on kernels after 5.7 (or an up-to-date 5.4 LTS version),
-  operations and permanently blacklisting any extent or hash involved
+  but there are still some remains of earlier workarounds for this issue
-  where the kernel starts to get slow.  In the bees log, such blocks
+  in bees that have not been fully removed.
  are labelled as 'toxic' hash/block addresses.  Toxic extents are
  rare (about 1 in 100,000 extents become toxic), but toxic extents can
  become 8 orders of magnitude more expensive to process than the fastest
  non-toxic extents.  This seems to affect all dedupe agents on btrfs;
  at this time of writing only bees has a workaround for this bug.
-  This workaround is less necessary for kernels 5.4.96, 5.7 and later,
+  bees avoided this bug by measuring the time the kernel spends performing
-  though the bees workaround can still be triggered on newer kernels
+  `LOGICAL_INO` operations and permanently blacklisting any extent or
-  by changes in btrfs since kernel version 5.1.
+  hash involved where the kernel starts to get slow.  In the bees log,
  such blocks are labelled as 'toxic' hash/block addresses.
  Future bees releases will remove toxic extent detection (it only detects
  false positives now) and clear all previously saved toxic extent bits.
 * **dedupe breaks `btrfs send` in old kernels**.  The bees option
  `--workaround-btrfs-send` prevents any modification of read-only subvols
-  in order to avoid breaking `btrfs send`.
+  in order to avoid breaking `btrfs send` on kernels before 5.2.
-  This workaround is no longer necessary to avoid kernel crashes
+  This workaround is no longer necessary to avoid kernel crashes and
-  and send performance failure on kernel 4.9.207, 4.14.159, 4.19.90,
+  send performance failure on kernel 5.4.4 and later.  bees will pause
-  5.3.17, 5.4.4, 5.5 and later; however, some conflict between send
+  dedupe until the send is finished on current kernels.
  and dedupe still remains, so the workaround is still useful.
  `btrfs receive` is not and has never been affected by this issue.
 Unfixed kernel bugs
 -------------------
 * **The kernel does not permit `btrfs send` and dedupe to run at the
  same time**.  Recent kernels no longer crash, but now refuse one
  operation with an error if the other operation was already running.
  bees has not been updated to handle the new dedupe behavior optimally.
  Optimal behavior is to defer dedupe operations when send is detected,
  and resume after the send is finished.  Current bees behavior is to
  complain loudly about each individual dedupe failure in log messages,
  and abandon duplicate data references in the snapshot that send is
  processing.  A future bees version shall have better handling for
  this situation.
  Workaround:  send `SIGSTOP` to bees, or terminate the bees process,
  before running `btrfs send`.
  This workaround is not strictly required if snapshot is deleted after
  sending.  In that case, any duplicate data blocks that were not removed
  by dedupe will be removed by snapshot delete instead.  The workaround
  still saves some IO.
  `btrfs receive` is not affected by this issue.
@@ -3,40 +3,34 @@ Good Btrfs Feature Interactions
 bees has been tested in combination with the following:
-* btrfs compression (zlib, lzo, zstd), mixtures of compressed and uncompressed extents
+* btrfs compression (zlib, lzo, zstd)
 * PREALLOC extents (unconditionally replaced with holes)
 * HOLE extents and btrfs no-holes feature
-* Other deduplicators, reflink copies (though bees may decide to redo their work)
+* Other deduplicators (`duperemove`, `jdupes`)
-* btrfs snapshots and non-snapshot subvols (RW and RO)
+* Reflink copies (modern coreutils `cp` and `mv`)
 * Concurrent file modification (e.g. PostgreSQL and sqlite databases, VMs, build daemons)
-* All btrfs RAID profiles
+* All btrfs RAID profiles:  single, dup, raid0, raid1, raid10, raid1c3, raid1c4, raid5, raid6
-* IO errors during dedupe (read errors will throw exceptions, bees will catch them and skip over the affected extent)
+* IO errors during dedupe (affected extents are skipped)
 * Filesystems mounted with or without the `flushoncommit` option
 * 4K filesystem data block size / clone alignment
 * 64-bit and 32-bit LE host CPUs (amd64, x86, arm)
 * Large files (kernel 5.4 or later strongly recommended)
-* Filesystems up to 90T+ bytes, 1000M+ files
+* Filesystem data sizes up to 100T+ bytes, 1000M+ files
 * `open(O_DIRECT)` (seems to work as well--or as poorly--with bees as with any other btrfs feature)
 * btrfs-convert from ext2/3/4
 * btrfs `autodefrag` mount option
 * btrfs balance (data balances cause rescan of relocated data)
 * btrfs block-group-tree
 * btrfs `flushoncommit` and `noflushoncommit` mount options
 * btrfs mixed block groups
 * btrfs `nodatacow`/`nodatasum` inode attribute or mount option (bees skips all nodatasum files)
 * btrfs qgroups and quota support (_not_ squotas)
 * btrfs receive
-* btrfs nodatacow/nodatasum inode attribute or mount option (bees skips all nodatasum files)
+* btrfs scrub
-* open(O_DIRECT) (seems to work as well--or as poorly--with bees as with any other btrfs feature)
+* btrfs send (dedupe pauses automatically, kernel 5.4 or later required)
-* lvm dm-cache, writecache
+* btrfs snapshot, non-snapshot subvols (RW and RO), snapshot delete
-Bad Btrfs Feature Interactions
+**Note:** some btrfs features have minimum kernel versions which are
------------------------------
+higher than the minimum kernel version for bees.
 bees has been tested in combination with the following, and various problems are known:
 * btrfs send:  there are bugs in `btrfs send` that can be triggered by
  bees on old kernels.  The [`--workaround-btrfs-send` option](options.md)
  works around this issue by preventing bees from modifying read-only
  snapshots.
 * btrfs qgroups:  very slow, sometimes hangs...and it's even worse when
  bees is running.
 * btrfs autodefrag mount option:  bees cannot distinguish autodefrag
  activity from normal filesystem activity, and may try to undo the
  autodefrag if duplicate copies of the defragmented data exist.
 Untested Btrfs Feature Interactions
 -----------------------------------
@@ -45,10 +39,6 @@ bees has not been tested with the following, and undesirable interactions may oc
 * Non-4K filesystem data block size (should work if recompiled)
 * Non-equal hash (SUM) and filesystem data block (CLONE) sizes (need to fix that eventually)
-* btrfs seed filesystems (no particular reason it wouldn't work, but no one has reported trying)
+* btrfs seed filesystems, raid-stripe-tree, squotas (no particular reason these wouldn't work, but no one has reported trying)
-* btrfs out-of-tree kernel patches (e.g. in-kernel dedupe, encryption, extent tree v2)
+* btrfs out-of-tree kernel patches (e.g. encryption, extent tree v2)
 * btrfs-convert from ext2/3/4 (never tested, might run out of space or ignore significant portions of the filesystem due to sanity checks)
 * btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
 * Host CPUs with exotic page sizes, alignment requirements, or endianness (ppc, alpha, sparc, strongarm, s390, mips, m68k...)
 * bcache: used to be in the "bad" list, now in the "untested" list because nobody is rigorously testing, and bcache bugs come and go
 * flashcache: an out-of-tree cache-HDD-on-SSD block layer helper
@@ -26,11 +26,7 @@ Here are some numbers to estimate appropriate hash table sizes:
 Notes:
 * If the hash table is too large, no extra dedupe efficiency is
-obtained, and the extra space wastes RAM.  If the hash table contains
+obtained, and the extra space wastes RAM.
 more block records than there are blocks in the filesystem, the extra
 space can slow bees down.  A table that is too large prevents obsolete
 data from being evicted, so bees wastes time looking for matching data
 that is no longer present on the filesystem.
 * If the hash table is too small, bees extrapolates from matching
 blocks to find matching adjacent blocks in the filesystem that have been
@@ -59,19 +55,19 @@ patterns on dedupe effectiveness without performing deep inspection of
 both the filesystem data and its structure--a task that is as expensive
 as performing the deduplication.
-* **Compression** on the filesystem reduces the average extent length
+* **Compression** in files reduces the average extent length compared
-compared to uncompressed filesystems.  The maximum compressed extent
+to uncompressed files.  The maximum compressed extent length on
-length on btrfs is 128KB, while the maximum uncompressed extent length
+btrfs is 128KB, while the maximum uncompressed extent length is 128MB.
-is 128MB.  Longer extents decrease the optimum hash table size while
+Longer extents decrease the optimum hash table size while shorter extents
-shorter extents increase the optimum hash table size because the
+increase the optimum hash table size, because the probability of a hash
-probability of a hash table entry being present (i.e. unevicted) in
+table entry being present (i.e. unevicted) in each extent is proportional
-each extent is proportional to the extent length.
+to the extent length.
   As a rule of thumb, the optimal hash table size for a compressed
 filesystem is 2-4x larger than the optimal hash table size for the same
-data on an uncompressed filesystem.  Dedupe efficiency falls dramatically
+data on an uncompressed filesystem.  Dedupe efficiency falls rapidly with
-with hash tables smaller than 128MB/TB as the average dedupe extent size
+hash tables smaller than 128MB/TB as the average dedupe extent size is
-is larger than the largest possible compressed extent size (128KB).
+larger than the largest possible compressed extent size (128KB).
 * **Short writes or fragmentation** also shorten the average extent
 length and increase optimum hash table size.  If a database writes to
@@ -98,27 +94,70 @@ code files over and over, so it will need a smaller hash table than a
 backup server which has to refer to the oldest data on the filesystem
 every time a new client machine's data is added to the server.
-Scanning modes for multiple subvols
+Scanning modes
-----------------------------------
+--------------
-The `--scan-mode` option affects how bees schedules worker threads
+The `--scan-mode` option affects how bees iterates over the filesystem,
-between subvolumes.  Scan modes are an experimental feature and will
+schedules extents for scanning, and tracks progress.
 likely be deprecated in favor of a better solution.
-Scan mode can be changed at any time by restarting bees with a different
+There are now two kinds of scan mode:  the legacy **subvol** scan modes,
-mode option.  Scan state tracking is the same for all of the currently
+and the new **extent** scan mode.
 implemented modes.  The difference between the modes is the order in
 which subvols are selected.
-If a filesystem has only one subvolume with data in it, then the
+Scan mode can be changed by restarting bees with a different scan mode
-`--scan-mode` option has no effect.  In this case, there is only one
+option.
 subvolume to scan, so worker threads will all scan that one.
-Within a subvol, there is a single optimal scan order:  files are scanned
+Extent scan mode:
-in ascending numerical inode order.  Each worker will scan a different
+
-inode to avoid having the threads contend with each other for locks.
+ * Works with 4.15 and later kernels.
-File data is read sequentially and in order, but old blocks from earlier
+ * Can estimate progress and provide an ETA.
-scans are skipped.
+ * Can optimize scanning order to dedupe large extents first.
 * Can keep up with frequent creation and deletion of snapshots.
 Subvol scan modes:
 * Work with 4.14 and earlier kernels.
 * Cannot estimate or report progress.
 * Cannot optimize scanning order by extent size.
 * Have problems keeping up with multiple snapshots created during a scan.
 The default scan mode is 4, "extent".
 If you are using bees for the first time on a filesystem with many
 existing snapshots, you should read about [snapshot gotchas](gotchas.md).
 Subvol scan modes
 -----------------
 Subvol scan modes are maintained for compatibility with existing
 installations, but will not be developed further.  New installations
 should use extent scan mode instead.
 The _quantity_ of text below detailing the shortcomings of each subvol
 scan mode should be informative all by itself.
 Subvol scan modes work on any kernel version supported by bees.  They
 are the only scan modes usable on kernel 4.14 and earlier.
 The difference between the subvol scan modes is the order in which the
 files from different subvols are fed into the scanner.  They all scan
 files in inode number order, from low to high offset within each inode,
 the same way that a program like `cat` would read files (but skipping
 over old data from earlier btrfs transactions).
 If a filesystem has only one subvolume with data in it, then all of
 the subvol scan modes are equivalent.  In this case, there is only one
 subvolume to scan, so every possible ordering of subvols is the same.
 The `--workaround-btrfs-send` option pauses scanning subvols that are
 read-only.  If the subvol is made read-write (e.g. with `btrfs prop set
 $subvol ro false`), or if the `--workaround-btrfs-send` option is removed,
 then the scan of that subvol is unpaused and dedupe proceeds normally.
 Space will only be recovered when the last read-only subvol is deleted.
 Subvol scan modes cannot efficiently or accurately calculate an ETA for
 completion or estimate progress through the data.  They simply request
 "the next new inode" from btrfs, and they are completed when btrfs says
 there is no next new inode.
 Between subvols, there are several scheduling algorithms with different
 trade-offs:
@@ -126,68 +165,151 @@ trade-offs:
 Scan mode 0, "lockstep", scans the same inode number in each subvol at
 close to the same time.  This is useful if the subvols are snapshots
 with a common ancestor, since the same inode number in each subvol will
-have similar or identical contents.  This maximizes the likelihood
+have similar or identical contents.  This maximizes the likelihood that
-that all of the references to a snapshot of a file are scanned at
+all of the references to a snapshot of a file are scanned at close to
-close to the same time, improving dedupe hit rate and possibly taking
+the same time, improving dedupe hit rate.  If the subvols are unrelated
-advantage of VFS caching in the Linux kernel.  If the subvols are
+(i.e. not snapshots of a single subvol) then this mode does not provide
-unrelated (i.e. not snapshots of a single subvol) then this mode does
+any significant advantage.  This mode uses smaller amounts of temporary
-not provide significant benefit over random selection.  This mode uses
+space for shorter periods of time when most subvols are snapshots.  When a
-smaller amounts of temporary space for shorter periods of time when most
+new snapshot is created, this mode will stop scanning other subvols and
-subvols are snapshots.  When a new snapshot is created, this mode will
+scan the new snapshot until the same inode number is reached in each
-stop scanning other subvols and scan the new snapshot until the same
+subvol, which will effectively stop dedupe temporarily as this data has
-inode number is reached in each subvol, which will effectively stop
+already been scanned and deduped in the other snapshots.
 dedupe temporarily as this data has already been scanned and deduped
 in the other snapshots.
-Scan mode 1, "independent", scans the next inode with new data in each
+Scan mode 1, "independent", scans the next inode with new data in
-subvol.  Each subvol's scanner shares inodes uniformly with all other
+each subvol.  There is no coordination between the subvols, other than
-subvol scanners until the subvol has no new inodes left.  This mode makes
+round-robin distribution of files from each subvol to each worker thread.
-continuous forward progress across the filesystem and provides average
+This mode makes continuous forward progress in all subvols.  When a new
-performance across a variety of workloads, but is slow to respond to new
+snapshot is created, previous subvol scans continue as before, but the
-data, and may spend a lot of time deduping short-lived subvols that will
+worker threads are now divided among one more subvol.
 soon be deleted when it is preferable to dedupe long-lived subvols that
 will be the origin of future snapshots.  When a new snapshot is created,
 previous subvol scans continue as before, but the time is now divided
 among one more subvol.
 Scan mode 2, "sequential", scans one subvol at a time, in numerical subvol
-ID order, processing each subvol completely before proceeding to the
+ID order, processing each subvol completely before proceeding to the next
-next subvol.  This avoids spending time scanning short-lived snapshots
+subvol.  This avoids spending time scanning short-lived snapshots that
-that will be deleted before they can be fully deduped (e.g. those used
+will be deleted before they can be fully deduped (e.g. those used for
-for `btrfs send`).  Scanning is concentrated on older subvols that are
+`btrfs send`).  Scanning starts on older subvols that are more likely
-more likely to be origin subvols for future snapshots, eliminating the
+to be origin subvols for future snapshots, eliminating the need to
-need to dedupe future snapshots separately.  This mode uses the largest
+dedupe future snapshots separately.  This mode uses the largest amount
-amount of temporary space for the longest time, and typically requires
+of temporary space for the longest time, and typically requires a larger
-a larger hash table to maintain dedupe hit rate.
+hash table to maintain dedupe hit rate.
 Scan mode 3, "recent", scans the subvols with the highest `min_transid`
 value first (i.e. the ones that were most recently completely scanned),
 then falls back to "independent" mode to break ties.  This interrupts
-long scans of old subvols to give a rapid dedupe response to new data,
+long scans of old subvols to give a rapid dedupe response to new data
-then returns to the old subvols after the new data is scanned.  It is
+in previously scanned subvols, then returns to the old subvols after
-useful for large filesystems with multiple active subvols and rotating
+the new data is scanned.
 snapshots, where the first-pass scan can take months, but new duplicate
 data appears every day.
-The default scan mode is 1, "independent".
+Extent scan mode
 ----------------
-If you are using bees for the first time on a filesystem with many
+Scan mode 4, "extent", scans the extent tree instead of the subvol trees.
-existing snapshots, you should read about [snapshot gotchas](gotchas.md).
+Extent scan mode reads each extent once, regardless of the number of
 reflinks or snapshots.  It adapts to the creation of new snapshots
 and reflinks immediately, without having to revisit old data.
 In the extent scan mode, extents are separated into multiple size tiers
 to prioritize large extents over small ones.  Deduping large extents
 keeps the metadata update cost low per block saved, resulting in faster
 dedupe at the start of a scan cycle.  This is important for maximizing
 performance in use cases where bees runs for a limited time, such as
 during an overnight maintenance window.
 Once the larger size tiers are completed, dedupe space recovery speeds
 slow down significantly.  It may be desirable to stop bees running once
 the larger size tiers are finished, then start bees running some time
 later after new data has appeared.
 Each extent is mapped in physical address order, and all extent references
 are submitted to the scanner at the same time, resulting in much better
 cache behavior and dedupe performance compared to the subvol scan modes.
 The "extent" scan mode is not usable on kernels before 4.15 because
 it relies on the `LOGICAL_INO_V2` ioctl added in that kernel release.
 When using bees with an older kernel, only subvol scan modes will work.
 Extents are divided into virtual subvols by size, using reserved btrfs
 subvol IDs 250..255.  The size tier groups are:
 * 250: 32M+1 and larger
 * 251: 8M+1..32M
 * 252: 2M+1..8M
 * 253: 512K+1..2M
 * 254: 128K+1..512K
 * 255: 128K and smaller (includes all compressed extents)
 Extent scan mode can efficiently calculate dedupe progress within
 the filesystem and estimate an ETA for completion within each size
 tier; however, the accuracy of the ETA can be questionable due to the
 non-uniform distribution of block addresses in a typical user filesystem.
 Older versions of bees do not recognize the virtual subvols, so running
 an old bees version after running a new bees version will reset the
 "extent" scan mode's progress in `beescrawl.dat` to the beginning.
 This may change in future bees releases, i.e. extent scans will store
 their checkpoint data somewhere else.
 The `--workaround-btrfs-send` option behaves differently in extent
 scan modes:  In extent scan mode, dedupe proceeds on all subvols that are
 read-write, but all subvols that are read-only are excluded from dedupe.
 Space will only be recovered when the last read-only subvol is deleted.
 During `btrfs send` all duplicate extents in the sent subvol will not be
 removed (the kernel will reject dedupe commands while send is active,
 and bees currently will not re-issue them after the send is complete).
 It may be preferable to terminate the bees process while running `btrfs
 send` in extent scan mode, and restart bees after the `send` is complete.
 Threads and load management
 ---------------------------
-By default, bees creates one worker thread for each CPU detected.
+By default, bees creates one worker thread for each CPU detected.  These
-These threads then perform scanning and dedupe operations.  The number of
+threads then perform scanning and dedupe operations.  bees attempts to
-worker threads can be set with the [`--thread-count` and `--thread-factor`
+maximize the amount of productive work each thread does, until either the
-options](options.md).
+threads are all continuously busy, or there is no remaining work to do.
-If desired, bees can automatically increase or decrease the number
+In many cases it is not desirable to continually run bees at maximum
-of worker threads in response to system load.  This reduces impact on
+performance.  Maximum performance is not necessary if bees can dedupe
-the rest of the system by pausing bees when other CPU and IO intensive
+new data faster than it appears on the filesystem.  If it only takes
-loads are active on the system, and resumes bees when the other loads
+bees 10 minutes per day to dedupe all new data on a filesystem, then
-are inactive.  This is configured with the [`--loadavg-target` and
+bees doesn't need to run for more than 10 minutes per day.
-`--thread-min` options](options.md).
+
 bees supports a number of options for reducing system load:
 * Run bees for a few hours per day, at an off-peak time (i.e. during
 a maintenace window), instead of running bees continuously.  Any data
 added to the filesystem while bees is not running will be scanned when
 bees restarts.  At the end of the maintenance window, terminate the
 bees process with SIGTERM to write the hash table and scan position
 for the next maintenance window.
 * Temporarily pause bees operation by sending the bees process SIGUSR1,
 and resume operation with SIGUSR2.  This is preferable to freezing
 and thawing the process, e.g. with freezer cgroups or SIGSTOP/SIGCONT
 signals, because it allows bees to close open file handles that would
 otherwise prevent those files from being deleted while bees is frozen.
 * Reduce the number of worker threads with the [`--thread-count` or
 `--thread-factor` options](options.md).  This simply leaves CPU cores
 idle so that other applications on the host can use them, or to save
 power.
 * Allow bees to automatically track system load and increase or decrease
 the number of threads to reach a target system load.  This reduces
 impact on the rest of the system by pausing bees when other CPU and IO
 intensive loads are active on the system, and resumes bees when the other
 loads are inactive.  This is configured with the [`--loadavg-target`
 and `--thread-min` options](options.md).
 * Allow bees to self-throttle operations that enqueue delayed work
 within btrfs.  These operations are not well controlled by Linux
 features such as process priority or IO priority or IO rate-limiting,
 because the enqueued work is submitted to btrfs several seconds before
 btrfs performs the work.  By the time btrfs performs the work, it's too
 late for external throttling to be effective.  The [`--throttle-factor`
 option](options.md) tracks how long it takes btrfs to complete queued
 operations, and reduces bees's queued work submission rate to match
 btrfs's queued work completion rate (or a fraction thereof, to reduce
 system load).
 Log verbosity
 -------------
@@ -120,10 +120,14 @@ The `crawl` event group consists of operations related to scanning btrfs trees t
 * `crawl_again`: An inode crawl was restarted because the extent was already locked by another running crawl.
 * `crawl_blacklisted`: An extent was not scanned because it belongs to a blacklisted file.
- * `crawl_create`: A new subvol crawler was created.
+ * `crawl_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
- * `crawl_done`: One pass over all subvols on the filesystem was completed.
+ * `crawl_done`: One pass over a subvol was completed.
 * `crawl_discard_high`: An extent that was too large for the crawler's size tier was discarded.
 * `crawl_discard_low`: An extent that was too small for the crawler's size tier was discarded.
 * `crawl_empty`: A `TREE_SEARCH_V2` ioctl call failed or returned an empty set (usually because all data in the subvol was scanned).
 * `crawl_extent`: The extent crawler queued all references to an extent for processing.
 * `crawl_fail`: A `TREE_SEARCH_V2` ioctl call failed.
 * `crawl_flop`: Small extent items were not skipped because the next extent started at or before the end of the previous extent.
 * `crawl_gen_high`: An extent item in the search results refers to an extent that is newer than the current crawl's `max_transid` allows.
 * `crawl_gen_low`: An extent item in the search results refers to an extent that is older than the current crawl's `min_transid` allows.
 * `crawl_hole`: An extent item in the search results refers to a hole.
@@ -135,8 +139,13 @@ The `crawl` event group consists of operations related to scanning btrfs trees t
 * `crawl_prealloc`: An extent item in the search results refers to a `PREALLOC` extent.
 * `crawl_push`: An extent item in the search results is suitable for scanning and deduplication.
 * `crawl_scan`: An extent item in the search results is submitted to `BeesContext::scan_forward` for scanning and deduplication.
 * `crawl_skip`: Small extent items were skipped because no extent of sufficient size was found within the minimum search distance.
 * `crawl_skip_ms`: Time spent skipping small extent items.
 * `crawl_search`: A `TREE_SEARCH_V2` ioctl call was successful.
 * `crawl_throttled`: Extent scan created too many work queue items and was prevented from creating any more.
 * `crawl_tree_block`: Extent scan found and skipped a metadata tree block.
 * `crawl_unknown`: An extent item in the search results has an unrecognized type.
 * `crawl_unthrottled`: Extent scan allowed to create work queue items again.
 dedup
 -----
@@ -162,6 +171,25 @@ The `exception` event group consists of C++ exceptions.  C++ exceptions are thro
 * `exception_caught`: Total number of C++ exceptions thrown and caught by a generic exception handler.
 * `exception_caught_silent`: Total number of "silent" C++ exceptions thrown and caught by a generic exception handler.  These are exceptions which are part of the correct and normal operation of bees.  The exceptions are logged at a lower log level.
 extent
 ------
 The `extent` event group consists of events that occur within the extent scanner.
 * `extent_deferred_inode`: A lock conflict was detected when two worker threads attempted to manipulate the same inode at the same time.
 * `extent_empty`: A complete list of references to an extent was created but the list was empty, e.g. because all refs are in deleted inodes or snapshots.
 * `extent_fail`: An ioctl call to `LOGICAL_INO` failed.
 * `extent_forward`: An extent reference was submitted for scanning.
 * `extent_mapped`: A complete map of references to an extent was created and added to the crawl queue.
 * `extent_ok`: An ioctl call to `LOGICAL_INO` completed successfully.
 * `extent_overflow`: A complete map of references to an extent exceeded `BEES_MAX_EXTENT_REF_COUNT`, so the extent was dropped.
 * `extent_ref_missing`: An extent reference reported by `LOGICAL_INO` was not found by later `TREE_SEARCH_V2` calls.
 * `extent_ref_ok`: One extent reference was queued for scanning.
 * `extent_restart`: An extent reference was requeued to be scanned again after an active extent lock is released.
 * `extent_retry`: An extent reference was requeued to be scanned again after an active inode lock is released.
 * `extent_skip`: A 4K extent with more than 1000 refs was skipped.
 * `extent_zero`: An ioctl call to `LOGICAL_INO` succeeded, but reported an empty list of extents.
 hash
 ----
@@ -180,24 +208,6 @@ The `hash` event group consists of operations related to the bees hash table.
 * `hash_insert`: A `(hash, address)` pair was inserted by `BeesHashTable::push_random_hash_addr`.
 * `hash_lookup`: The hash table was searched for `(hash, address)` pairs matching a given `hash`.
 inserted
 --------
 The `inserted` event group consists of operations related to storing hash and address data in the hash table (i.e. the hash table client).
 * `inserted_block`: Total number of data block references scanned and inserted into the hash table.
 * `inserted_clobbered`: Total number of data block references scanned and eliminated from the filesystem.
 matched
 -------
 The `matched` event group consists of events related to matching incoming data blocks against existing hash table entries.
 * `matched_0`: A data block was scanned, hash table entries found, but no matching data blocks on the filesytem located.
 * `matched_1_or_more`: A data block was scanned, hash table entries found, and one or more matching data blocks on the filesystem located.
 * `matched_2_or_more`: A data block was scanned, hash table entries found, and two or more matching data blocks on the filesystem located.
 * `matched_3_or_more`: A data block was scanned, hash table entries found, and three or more matching data blocks on the filesystem located.
 open
 ----
@@ -259,12 +269,29 @@ The `pairforward` event group consists of events related to extending matching b
 * `pairforward_try`: Started extending a pair of matching block ranges forward.
 * `pairforward_zero`: A pair of matching block ranges could not be extended backward by one block because the src block contained all zeros and was not compressed.
 progress
 --------
 The `progress` event group consists of events related to progress estimation.
 * `progress_no_data_bg`: Failed to retrieve any data block groups from the filesystem.
 * `progress_not_created`: A crawler for one size tier had not been created for the extent scanner.
 * `progress_complete`: A crawler for one size tier has completed a scan.
 * `progress_not_found`: The extent position for a crawler does not correspond to any block group.
 * `progress_out_of_bg`: The extent position for a crawler does not correspond to any data block group.
 * `progress_ok`: Table of progress and ETA created successfully.
 readahead
 ---------
-The `readahead` event group consists of events related to calls to `posix_fadvise`.
+The `readahead` event group consists of events related to data prefetching (formerly calls to `posix_fadvise` or `readahead`, but now emulated in userspace).
- * `readahead_ms`: Total time spent running `posix_fadvise(..., POSIX_FADV_WILLNEED)` aka `readahead()`.
+ * `readahead_bytes`: Number of bytes prefetched.
 * `readahead_count`: Number of read calls.
 * `readahead_clear`: Number of times the duplicate read cache was cleared.
 * `readahead_fail`: Number of read errors during prefetch.
 * `readahead_ms`: Total time spent emulating readahead in user-space (kernel readahead is not measured).
 * `readahead_skip`: Number of times a duplicate read was identified in the cache and skipped.
 * `readahead_unread_ms`: Total time spent running `posix_fadvise(..., POSIX_FADV_DONTNEED)`.
 replacedst
@@ -301,7 +328,7 @@ The `resolve` event group consists of operations related to translating a btrfs
 * `resolve_large`: The `LOGICAL_INO` ioctl returned more than 2730 results (the limit of the v1 ioctl).
 * `resolve_ms`: Total time spent in the `LOGICAL_INO` ioctl (i.e. wallclock time, not kernel CPU time).
 * `resolve_ok`: The `LOGICAL_INO` ioctl returned success.
- * `resolve_overflow`: The `LOGICAL_INO` ioctl returned more than 655050 extents (the limit of the v2 ioctl).
+ * `resolve_overflow`: The `LOGICAL_INO` ioctl returned 9999 or more extents (the limit configured in `bees.h`).
 * `resolve_toxic`: The `LOGICAL_INO` ioctl took more than 0.1 seconds of kernel CPU time.
 root
@@ -329,35 +356,38 @@ The `scan` event group consists of operations related to scanning incoming data.
 * `scan_blacklisted`: A blacklisted extent was passed to `scan_forward` and dropped.
 * `scan_block`: A block of data was scanned.
- * `scan_bump`: After deduping a block range, the scan pointer had to be moved past the end of the deduped byte range.
+ * `scan_compressed_no_dedup`: An extent that was compressed contained non-zero, non-duplicate data.
- * `scan_dup_block`: Number of duplicate blocks deduped.
+ * `scan_dup_block`: Number of duplicate block references deduped.
- * `scan_dup_hit`: A pair of duplicate block ranges was found and removed.
+ * `scan_dup_hit`: A pair of duplicate block ranges was found.
 * `scan_dup_miss`: A pair of duplicate blocks was found in the hash table but not in the filesystem.
 * `scan_eof`: Scan past EOF was attempted.
 * `scan_erase_redundant`: Blocks in the hash table were removed because they were removed from the filesystem by dedupe.
 * `scan_extent`: An extent was scanned (`scan_one_extent`).
 * `scan_extent_tiny`: An extent below 128K that was not the beginning or end of a file was scanned.  No action is currently taken for these--they are merely counted.
 * `scan_forward`: A logical byte range was scanned (`scan_forward`).
 * `scan_found`: An entry was found in the hash table matching a scanned block from the filesystem.
 * `scan_hash_hit`: A block was found on the filesystem corresponding to a block found in the hash table.
 * `scan_hash_miss`: A block was not found on the filesystem corresponding to a block found in the hash table.
- * `scan_hash_preinsert`: A block was prepared for insertion into the hash table.
+ * `scan_hash_preinsert`: A non-zero data block's hash was prepared for possible insertion into the hash table.
 * `scan_hash_insert`: A non-zero data block's hash was inserted into the hash table.
 * `scan_hole`: A hole extent was found during scan and ignored.
 * `scan_interesting`: An extent had flags that were not recognized by bees and was ignored.
 * `scan_lookup`: A hash was looked up in the hash table.
 * `scan_malign`: A block being scanned matched a hash at EOF in the hash table, but the EOF was not aligned to a block boundary and the two blocks did not have the same length.
 * `scan_no_fd`: References to a block from the hash table were found, but a FD could not be opened.
 * `scan_no_rewrite`: All blocks in an extent were removed by dedupe (i.e. no copies).
 * `scan_push_front`: An entry in the hash table matched a duplicate block, so the entry was moved to the head of its LRU list.
 * `scan_reinsert`: A copied block's hash and block address was inserted into the hash table.
 * `scan_resolve_hit`: A block address in the hash table was successfully resolved to an open FD and offset pair.
 * `scan_resolve_zero`: A block address in the hash table was not resolved to any subvol/inode pair, so the corresponding hash table entry was removed.
 * `scan_rewrite`: A range of bytes in a file was copied, then the copy deduped over the original data.
 * `scan_root_dead`: A deleted subvol was detected.
 * `scan_seen_clear`: The list of recently scanned extents reached maximum size and was cleared.
 * `scan_seen_erase`: An extent reference was modified by scan, so all future references to the extent must be scanned.
 * `scan_seen_hit`: A scan was skipped because the same extent had recently been scanned.
 * `scan_seen_insert`: An extent reference was not modified by scan and its hashes have been inserted into the hash table, so all future references to the extent can be ignored.
 * `scan_seen_miss`: A scan was not skipped because the same extent had not recently been scanned (i.e. the extent was scanned normally).
 * `scan_skip_bytes`: Nuisance dedupe or hole-punching would save less than half of the data in an extent.
 * `scan_skip_ops`: Nuisance dedupe or hole-punching would require too many dedupe/copy/hole-punch operations in an extent.
 * `scan_toxic_hash`: A scanned block has the same hash as a hash table entry that is marked toxic.
 * `scan_toxic_match`: A hash table entry points to a block that is discovered to be toxic.
 * `scan_twice`: Two references to the same block have been found in the hash table.
- * `scan_zero_compressed`: An extent that was compressed and contained only zero bytes was found.
+ * `scan_zero`: A data block containing only zero bytes was detected.
 * `scan_zero_uncompressed`: A block that contained only zero bytes was found in an uncompressed extent.
 scanf
 -----
@@ -365,9 +395,10 @@ scanf
 The `scanf` event group consists of operations related to `BeesContext::scan_forward`.  This is the entry point where `crawl` schedules new data for scanning.
 * `scanf_deferred_extent`: Two tasks attempted to scan the same extent at the same time, so one was deferred.
- * `scanf_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
+ * `scanf_eof`: Scan past EOF was attempted.
 * `scanf_extent`: A btrfs extent item was scanned.
 * `scanf_extent_ms`: Total thread-seconds spent scanning btrfs extent items.
 * `scanf_no_fd`: References to a block from the hash table were found, but a FD could not be opened.
 * `scanf_total`: A logical byte range of a file was scanned.
 * `scanf_total_ms`: Total thread-seconds spent scanning logical byte ranges.
@@ -205,7 +205,7 @@ Other Gotchas
 * bees avoids the [slow backrefs kernel bug](btrfs-kernel.md) by
  measuring the time required to perform `LOGICAL_INO` operations.
-  If an extent requires over 0.1 kernel CPU seconds to perform a
+  If an extent requires over 5.0 kernel CPU seconds to perform a
  `LOGICAL_INO` ioctl, then bees blacklists the extent and avoids
  referencing it in future operations.  In most cases, fewer than 0.1%
  of extents in a filesystem must be avoided this way.  This results
@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
 About bees
 ----------
-bees is a block-oriented userspace deduplication agent designed for large
+bees is a block-oriented userspace deduplication agent designed to scale
-btrfs filesystems.  It is an offline dedupe combined with an incremental
+up to large btrfs filesystems.  It is an offline dedupe combined with
-data scan capability to minimize time data spends on disk from write
+an incremental data scan capability to minimize time data spends on disk
-to dedupe.
+from write to dedupe.
 Strengths
 ---------
- * Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
+ * Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
- * Daemon incrementally dedupes new data using btrfs tree search
+ * Daemon mode - incrementally dedupes new data as it appears
 * Largest extents first - recover more free space during fixed maintenance windows
 * Works with btrfs compression - dedupe any combination of compressed and uncompressed files
- * Works around btrfs filesystem structure to free more disk space
+ * Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
 * Persistent hash table for rapid restart after shutdown
 * Whole-filesystem dedupe - including snapshots
 * Constant hash table size - no increased RAM usage if data set becomes larger
 * Works on live data - no scheduled downtime required
- * Automatic self-throttling based on system load
+ * Automatic self-throttling - reduces system load
 * btrfs support - recovers more free space from btrfs than naive dedupers
 Weaknesses
 ----------
 * Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
- * Requires root privilege (or `CAP_SYS_ADMIN`)
+ * Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
 * First run may require temporary disk space for extent reorganization
 * [First run may increase metadata space usage if many snapshots exist](gotchas.md)
 * Constant hash table size - no decreased RAM usage if data set becomes smaller
 * btrfs only
@@ -46,7 +46,7 @@ Recommended Reading
 -------------------
 * [bees Gotchas](gotchas.md)
- * [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING
+ * [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
 * [bees vs. other btrfs features](btrfs-other.md)
 * [What to do when something goes wrong](wrong.md)
@@ -69,6 +69,6 @@ You can also use Github:
 Copyright & License
 -------------------
-Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
+Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.
 GPL (version 3 or later).
@@ -15,16 +15,9 @@ specific files (patches welcome).
 * PREALLOC extents and extents containing blocks filled with zeros will
 be replaced by holes.  There is no way to turn this off.
-* Consecutive runs of duplicate blocks that are less than 12K in length
+* The fundamental unit of deduplication is the extent _reference_, when
-can take 30% of the processing time while saving only 3% of the disk
+it should be the _extent_ itself.  This is an architectural limitation
-space.  There should be an option to just not bother with those, but it's
+that results in excess reads of extent data, even in the Extent scan mode.
 complicated by the btrfs requirement to always dedupe complete extents.
 * There is a lot of duplicate reading of blocks in snapshots.  bees will
 scan all snapshots at close to the same time to try to get better
 performance by caching, but really fixing this requires rewriting the
 crawler to scan the btrfs extent tree directly instead of the subvol
 FS trees.
 * Block reads are currently more allocation- and CPU-intensive than they
 should be, especially for filesystems on SSD where the IO overhead is
@@ -33,8 +26,9 @@ much smaller.  This is a problem for CPU-power-constrained environments
 * bees can currently fragment extents when required to remove duplicate
 blocks, but has no defragmentation capability yet.  When possible, bees
-will attempt to work with existing extent boundaries, but it will not
+will attempt to work with existing extent boundaries and choose the
-aggregate blocks together from multiple extents to create larger ones.
+largest fragments available, but it will not aggregate blocks together
 from multiple extents to create larger ones.
 * When bees fragments an extent, the copied data is compressed.  There
 is currently no way (other than by modifying the source) to select a
@@ -36,6 +36,34 @@
 Has no effect unless `--loadavg-target` is used to specify a target load.
 * `--throttle-factor FACTOR`
 In order to avoid saturating btrfs deferred work queues, bees tracks
 the time that operations with delayed effect (dedupe and tmpfile copy)
 and operations with long run times (`LOGICAL_INO`) run.  If an operation
 finishes before the average run time for that operation, bees will
 sleep for the remainder of the average run time, so that operations
 are submitted to btrfs at a rate similar to the rate that btrfs can
 complete them.
 The `FACTOR` is multiplied by the average run time for each operation
 to calculate the target delay time.
 `FACTOR` 0 is the default, which adds no delays.  bees will attempt
 to saturate btrfs delayed work queues as quickly as possible, which
 may impact other processes on the same filesystem, or even slow down
 bees itself.
 `FACTOR` 1.0 will attempt to keep btrfs delayed work queues filled at
 a steady average rate.
 `FACTOR` more than 1.0 will add delays longer than the average
 run time (e.g. 10.0 will delay all operations that take less than 10x
 the average run time).  High values of `FACTOR` may be desirable when
 using bees with other applications on the same filesystem.
 The maximum delay per operation is 60 seconds.
 ## Filesystem tree traversal options
 * `--scan-mode MODE` or `-m`
@@ -47,6 +75,7 @@
  * Mode 1: independent
  * Mode 2: sequential
  * Mode 3: recent
  * Mode 4: extent
 For details of the different scanning modes and the default value of
 this option, see [bees configuration](config.md).
@@ -55,19 +84,22 @@
 * `--workaround-btrfs-send` or `-a`
 _This option is obsolete and should not be used any more._
 Pretend that read-only snapshots are empty and silently discard any
-request to dedupe files referenced through them.  This is a workaround for
+request to dedupe files referenced through them.  This is a workaround
-[problems with the kernel implementation of `btrfs send` and `btrfs send
+for [problems with old kernels running `btrfs send` and `btrfs send
 -p`](btrfs-kernel.md) which make these btrfs features unusable with bees.
- This option should be used to avoid breaking `btrfs send` on the same
+ This option was used to avoid breaking `btrfs send` on old kernels.
-filesystem.
+ The affected kernels are now too old to be recommended for use with bees.
 bees now waits for `btrfs send` to finish.  There is no need for an
 option to enable this.
 **Note:** There is a _significant_ space tradeoff when using this option:
 it is likely no space will be recovered--and possibly significant extra
-space used--until the read-only snapshots are deleted.  On the other
+space used--until the read-only snapshots are deleted.
 hand, if snapshots are rotated frequently then bees will spend less time
 scanning them.
 ## Logging options
@@ -75,9 +75,8 @@ in the shell script that launches `bees`:
        schedtool -D -n20 $$
        ionice -c3 -p $$
-You can also use the [`--loadavg-target` and `--thread-min`
+You can also use the [load management options](options.md) to further
-options](options.md) to further control the impact of bees on the rest
+control the impact of bees on the rest of the system.
 of the system.
 Let the bees fly:
@@ -4,16 +4,13 @@ What to do when something goes wrong with bees
 Hangs and excessive slowness
 ----------------------------
 ### Are you using qgroups or autodefrag?
  Read about [bad btrfs feature interactions](btrfs-other.md).
 ### Use load-throttling options
  If bees is just more aggressive than you would like, consider using
  [load throttling options](options.md).  These are usually more effective
  than `ionice`, `schedtool`, and the `blkio` cgroup (though you can
-  certainly use those too).
+  certainly use those too) because they limit work that bees queues up
  for later execution inside btrfs.
 ### Check `$BEESSTATUS`
@@ -52,10 +49,6 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li
 Thread names of note:
 * `crawl_12345`: scan/dedupe worker threads (the number is the subvol
   ID which the thread is currently working on).  These threads appear
   and disappear from the status dynamically according to the requirements
   of the work queue and loadavg throttling.
 * `bees`: main thread (doesn't do anything after startup, but its task execution time is that of the whole bees process)
 * `crawl_master`: task that finds new extents in the filesystem and populates the work queue
 * `crawl_transid`: btrfs transid (generation number) tracker and polling thread
@@ -64,6 +57,13 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li
 * `hash_writeback`: trickle-writes the hash table back to `beeshash.dat`
 * `hash_prefetch`: prefetches the hash table at startup and updates `beesstats.txt` hourly
 Most other threads have names that are derived from the current dedupe
 task that they are executing:
 * `ref_205ad76b1000_24K_50`:  extent scan performing dedupe of btrfs extent bytenr `205ad76b1000`, which is 24 KiB long and has 50 references
 * `extent_250_32M_16E`:  extent scan searching for extents between 32 MiB + 1 and 16 EiB bytes long, tracking scan position in virtual subvol `250`.
 * `crawl_378_18916`:  subvol scan searching for extent refs in subvol `378`, inode `18916`.
 ### Dump kernel stacks of hung processes
 Check the kernel stacks of all blocked kernel processes:
@@ -91,7 +91,7 @@ bees Crashes
        (gdb) thread apply all bt full
  The last line generates megabytes of output and will often crash gdb.
-  This is OK, submit whatever output gdb can produce.
+  Submit whatever output gdb can produce.
  **Note that this output may include filenames or data from your
  filesystem.**
@@ -160,8 +160,7 @@ Kernel crashes, corruption, and filesystem damage
 -------------------------------------------------
 bees doesn't do anything that _should_ cause corruption or data loss;
-however, [btrfs has kernel bugs](btrfs-kernel.md) and [interacts poorly
+however, [btrfs has kernel bugs](btrfs-kernel.md), so corruption is
 with some Linux block device layers](btrfs-other.md), so corruption is
 not impossible.
 Issues with the btrfs filesystem kernel code or other block device layers
@@ -49,6 +49,7 @@ namespace crucible {
 		/// @}
 		/// @{ Inode items
 		uint64_t inode_flags() const;
 		uint64_t inode_size() const;
 		/// @}
@@ -64,11 +65,13 @@ namespace crucible {
 		/// @{ Extent items (EXTENT_ITEM)
 		uint64_t extent_begin() const;
 		uint64_t extent_end() const;
 		uint64_t extent_flags() const;
 		uint64_t extent_generation() const;
 		/// @}
 		/// @{ Root items
 		uint64_t root_flags() const;
 		uint64_t root_refs() const;
 		/// @}
 		/// @{ Root backref items.
@@ -108,7 +111,9 @@ namespace crucible {
 		virtual ~BtrfsTreeFetcher() = default;
 		BtrfsTreeFetcher(Fd new_fd);
 		void type(uint8_t type);
 		uint8_t type();
 		void tree(uint64_t tree);
 		uint64_t tree();
 		void transid(uint64_t min_transid, uint64_t max_transid = numeric_limits<uint64_t>::max());
 		/// Block size (sectorsize) of filesystem
 		uint64_t block_size() const;
@@ -169,34 +174,42 @@ namespace crucible {
 		void get_sums(uint64_t logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t count)> output);
 	};
-	/// Fetch extent items from extent tree
+	/// Fetch extent items from extent tree.
 	/// Does not filter out metadata!  See BtrfsDataExtentTreeFetcher for that.
 	class BtrfsExtentItemFetcher : public BtrfsTreeObjectFetcher {
 	public:
 		BtrfsExtentItemFetcher(const Fd &fd);
 	};
-	/// Fetch extent refs from an inode
+	/// Fetch extent refs from an inode.  Caller must set the tree and objectid.
 	class BtrfsExtentDataFetcher : public BtrfsTreeOffsetFetcher {
 	public:
 		BtrfsExtentDataFetcher(const Fd &fd);
 	};
-	/// Fetch inodes from a subvol
+	/// Fetch raw inode items
 	class BtrfsFsTreeFetcher : public BtrfsTreeObjectFetcher {
 	public:
 		BtrfsFsTreeFetcher(const Fd &fd, uint64_t subvol);
 	};
 	class BtrfsInodeFetcher : public BtrfsTreeObjectFetcher {
 	public:
 		BtrfsInodeFetcher(const Fd &fd);
 		BtrfsTreeItem stat(uint64_t subvol, uint64_t inode);
 	};
 	/// Fetch a root (subvol) item
 	class BtrfsRootFetcher : public BtrfsTreeObjectFetcher {
 	public:
 		BtrfsRootFetcher(const Fd &fd);
 		BtrfsTreeItem root(uint64_t subvol);
 		BtrfsTreeItem root_backref(uint64_t subvol);
 	};
 	/// Fetch data extent items from extent tree, skipping metadata-only block groups
 	class BtrfsDataExtentTreeFetcher : public BtrfsExtentItemFetcher {
 		BtrfsTreeItem		m_current_bg;
 		BtrfsTreeOffsetFetcher	m_chunk_tree;
 	protected:
 		virtual void next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr) override;
 	public:
 		BtrfsDataExtentTreeFetcher(const Fd &fd);
 	};
 }
@@ -78,9 +78,6 @@ enum btrfs_compression_type {
 	#define BTRFS_SHARED_BLOCK_REF_KEY      182
 	#define BTRFS_SHARED_DATA_REF_KEY       184
 	#define BTRFS_BLOCK_GROUP_ITEM_KEY 192
 	#define BTRFS_FREE_SPACE_INFO_KEY 198
 	#define BTRFS_FREE_SPACE_EXTENT_KEY 199
 	#define BTRFS_FREE_SPACE_BITMAP_KEY 200
 	#define BTRFS_DEV_EXTENT_KEY    204
 	#define BTRFS_DEV_ITEM_KEY      216
 	#define BTRFS_CHUNK_ITEM_KEY    228
@@ -94,7 +91,35 @@ enum btrfs_compression_type {
 	#define BTRFS_UUID_KEY_SUBVOL   251
 	#define BTRFS_UUID_KEY_RECEIVED_SUBVOL  252
 	#define BTRFS_STRING_ITEM_KEY   253
 #endif
 // BTRFS_INODE_* was added to include/uapi/btrfs_tree.h in v6.2-rc1
 #ifndef BTRFS_INODE_NODATASUM
 	#define BTRFS_INODE_NODATASUM		(1U << 0)
 	#define BTRFS_INODE_NODATACOW		(1U << 1)
 	#define BTRFS_INODE_READONLY		(1U << 2)
 	#define BTRFS_INODE_NOCOMPRESS		(1U << 3)
 	#define BTRFS_INODE_PREALLOC		(1U << 4)
 	#define BTRFS_INODE_SYNC		(1U << 5)
 	#define BTRFS_INODE_IMMUTABLE		(1U << 6)
 	#define BTRFS_INODE_APPEND		(1U << 7)
 	#define BTRFS_INODE_NODUMP		(1U << 8)
 	#define BTRFS_INODE_NOATIME		(1U << 9)
 	#define BTRFS_INODE_DIRSYNC		(1U << 10)
 	#define BTRFS_INODE_COMPRESS		(1U << 11)
 	#define BTRFS_INODE_ROOT_ITEM_INIT	(1U << 31)
 #endif
 #ifndef BTRFS_FREE_SPACE_INFO_KEY
 	#define BTRFS_FREE_SPACE_INFO_KEY 198
 	#define BTRFS_FREE_SPACE_EXTENT_KEY 199
 	#define BTRFS_FREE_SPACE_BITMAP_KEY 200
 	#define BTRFS_FREE_SPACE_OBJECTID -11ULL
 #endif
 #ifndef BTRFS_BLOCK_GROUP_RAID1C4
 	#define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
 	#define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
 #endif
 #ifndef BTRFS_DEFRAG_RANGE_START_IO
@@ -55,7 +55,6 @@ namespace crucible {
 		Pointer m_ptr;
 		size_t m_size = 0;
 		mutable mutex m_mutex;
 	friend ostream & operator<<(ostream &os, const ByteVector &bv);
 	};
 	template <class T>
@@ -74,6 +73,8 @@ namespace crucible {
 		THROW_CHECK2(out_of_range, size(), sizeof(T), size() >= sizeof(T));
 		return reinterpret_cast<T*>(data());
 	}
 	ostream& operator<<(ostream &os, const ByteVector &bv);
 }
 #endif // _CRUCIBLE_BYTEVECTOR_H_
@@ -197,11 +197,18 @@ namespace crucible {
 		size_t m_buf_size;
 		set<BtrfsIoctlSearchHeader> m_result;
 		static thread_local size_t s_calls;
 		static thread_local size_t s_loops;
 		static thread_local size_t s_loops_empty;
 		static thread_local shared_ptr<ostream> s_debug_ostream;
 	};
 	ostream & operator<<(ostream &os, const btrfs_ioctl_search_key &key);
 	ostream & operator<<(ostream &os, const BtrfsIoctlSearchKey &key);
 	string btrfs_chunk_type_ntoa(uint64_t type);
 	string btrfs_inode_flags_ntoa(uint64_t inode_flags);
 	string btrfs_search_type_ntoa(unsigned type);
 	string btrfs_search_objectid_ntoa(uint64_t objectid);
 	string btrfs_compress_type_ntoa(uint8_t type);
@@ -239,14 +246,14 @@ namespace crucible {
 		unsigned long available() const;
 	};
 	template<class V> ostream &hexdump(ostream &os, const V &v);
 	struct BtrfsIoctlFsInfoArgs : public btrfs_ioctl_fs_info_args_v3 {
 		BtrfsIoctlFsInfoArgs();
 		void do_ioctl(int fd);
 		bool do_ioctl_nothrow(int fd);
 		uint16_t csum_type() const;
 		uint16_t csum_size() const;
 		uint64_t generation() const;
 		vector<uint8_t> fsid() const;
 	};
 	ostream & operator<<(ostream &os, const BtrfsIoctlFsInfoArgs &a);
@@ -12,12 +12,14 @@ namespace crucible {
 	ostream &
 	hexdump(ostream &os, const V &v)
 	{
-		os << "V { size = " << v.size() << ", data:\n";
+		const auto v_size = v.size();
-		for (size_t i = 0; i < v.size(); i += 8) {
+		const uint8_t* const v_data = reinterpret_cast<const uint8_t*>(v.data());
 		os << "V { size = " << v_size << ", data:\n";
 		for (size_t i = 0; i < v_size; i += 8) {
 			string hex, ascii;
 			for (size_t j = i; j < i + 8; ++j) {
-				if (j < v.size()) {
+				if (j < v_size) {
-					uint8_t c = v[j];
+					const uint8_t c = v_data[j];
 					char buf[8];
 					sprintf(buf, "%02x ", c);
 					hex += buf;
@@ -117,7 +117,7 @@ namespace crucible {
 		while (full() || locked(name)) {
 			m_condvar.wait(lock);
 		}
-		auto rv = m_set.insert(make_pair(name, crucible::gettid()));
+		auto rv = m_set.insert(make_pair(name, gettid()));
 		THROW_CHECK0(runtime_error, rv.second);
 	}
@@ -129,7 +129,7 @@ namespace crucible {
 		if (full() || locked(name)) {
 			return false;
 		}
-		auto rv = m_set.insert(make_pair(name, crucible::gettid()));
+		auto rv = m_set.insert(make_pair(name, gettid()));
 		THROW_CHECK1(runtime_error, name, rv.second);
 		return true;
 	}
@@ -14,6 +14,7 @@ namespace crucible {
 		mutex m_mutex;
 		condition_variable m_cv;
 		map<string, size_t> m_counters;
 		bool m_do_locking = true;
 		class LockHandle {
 			const string m_type;
@@ -33,6 +34,7 @@ namespace crucible {
 		shared_ptr<LockHandle> get_lock_private(const string &type);
 	public:
 		static shared_ptr<LockHandle> get_lock(const string &type);
 		static void enable_locking(bool enabled);
 	};
 }
@@ -0,0 +1,52 @@
 #ifndef CRUCIBLE_OPENAT2_H
 #define CRUCIBLE_OPENAT2_H
 #include <cstdlib>
 // Compatibility for building on old libc for new kernel
 #include <linux/version.h>
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0)
 #include <linux/openat2.h>
 #else
 #include <linux/types.h>
 #ifndef RESOLVE_NO_XDEV
 #define RESOLVE_NO_XDEV 1
 // RESOLVE_NO_XDEV was there from the beginning of openat2,
 // so if that's missing, so is open_how
 struct open_how {
 	__u64 flags;
 	__u64 mode;
 	__u64 resolve;
 };
 #endif
 #ifndef RESOLVE_NO_MAGICLINKS
 #define RESOLVE_NO_MAGICLINKS 2
 #endif
 #ifndef RESOLVE_NO_SYMLINKS
 #define RESOLVE_NO_SYMLINKS 4
 #endif
 #ifndef RESOLVE_BENEATH
 #define RESOLVE_BENEATH 8
 #endif
 #ifndef RESOLVE_IN_ROOT
 #define RESOLVE_IN_ROOT 16
 #endif
 #endif // Linux version >= v5.6
 extern "C" {
 /// Weak symbol to support libc with no syscall wrapper
 int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size) throw();
 };
 #endif // CRUCIBLE_OPENAT2_H
@@ -10,6 +10,10 @@
 #include <sys/wait.h>
 #include <unistd.h>
 extern "C" {
 	pid_t gettid() throw();
 };
 namespace crucible {
 	using namespace std;
@@ -73,7 +77,6 @@ namespace crucible {
 	typedef ResourceHandle<Process::id, Process> Pid;
 	pid_t gettid();
 	double getloadavg1();
 	double getloadavg5();
 	double getloadavg15();
@@ -6,23 +6,23 @@
 #include <algorithm>
 #include <limits>
-#include <cstdint>
+// Debug stream
-
+#include <memory>
 #if 1
 #include <iostream>
 #include <sstream>
-#define DINIT(__x) __x
+
-#define DLOG(__x) do { logs << __x << std::endl; } while (false)
+#include <cstdint>
 #define DOUT(__err) do { __err << logs.str(); } while (false)
 #else
 #define DINIT(__x) do {} while (false)
 #define DLOG(__x) do {} while (false)
 #define DOUT(__x) do {} while (false)
 #endif
 namespace crucible {
 	using namespace std;
 	extern thread_local shared_ptr<ostream> tl_seeker_debug_str;
 	#define SEEKER_DEBUG_LOG(__x) do { \
 		if (tl_seeker_debug_str) { \
 			(*tl_seeker_debug_str) << __x << "\n"; \
 		} \
 	} while (false)
 	// Requirements for Container<Pos> Fetch(Pos lower, Pos upper):
 	// - fetches objects in Pos order, starting from lower (must be >= lower)
 	// - must return upper if present, may or may not return objects after that
@@ -49,113 +49,108 @@ namespace crucible {
 	Pos
 	seek_backward(Pos const target_pos, Fetch fetch, Pos min_step = 1, size_t max_loops = numeric_limits<size_t>::max())
 	{
-		DINIT(ostringstream logs);
+		static const Pos end_pos = numeric_limits<Pos>::max();
-		try {
+		// TBH this probably won't work if begin_pos != 0, i.e. any signed type
-			static const Pos end_pos = numeric_limits<Pos>::max();
+		static const Pos begin_pos = numeric_limits<Pos>::min();
-			// TBH this probably won't work if begin_pos != 0, i.e. any signed type
+		// Run a binary search looking for the highest key below target_pos.
-			static const Pos begin_pos = numeric_limits<Pos>::min();
+		// Initial upper bound of the search is target_pos.
-			// Run a binary search looking for the highest key below target_pos.
+		// Find initial lower bound by doubling the size of the range until a key below target_pos
-			// Initial upper bound of the search is target_pos.
+		// is found, or the lower bound reaches the beginning of the search space.
-			// Find initial lower bound by doubling the size of the range until a key below target_pos
+		// If the lower bound search reaches the beginning of the search space without finding a key,
-			// is found, or the lower bound reaches the beginning of the search space.
+		// return the beginning of the search space; otherwise, perform a binary search between
-			// If the lower bound search reaches the beginning of the search space without finding a key,
+		// the bounds now established.
-			// return the beginning of the search space; otherwise, perform a binary search between
+		Pos lower_bound = 0;
-			// the bounds now established.
+		Pos upper_bound = target_pos;
-			Pos lower_bound = 0;
+		bool found_low = false;
-			Pos upper_bound = target_pos;
+		Pos probe_pos = target_pos;
-			bool found_low = false;
+		// We need one loop for each bit of the search space to find the lower bound,
-			Pos probe_pos = target_pos;
+		// one loop for each bit of the search space to find the upper bound,
-			// We need one loop for each bit of the search space to find the lower bound,
+		// and one extra loop to confirm the boundary is correct.
-			// one loop for each bit of the search space to find the upper bound,
+		for (size_t loop_count = min((1 + numeric_limits<Pos>::digits) * size_t(2), max_loops); loop_count; --loop_count) {
-			// and one extra loop to confirm the boundary is correct.
+			SEEKER_DEBUG_LOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
-			for (size_t loop_count = min(numeric_limits<Pos>::digits * size_t(2) + 1, max_loops); loop_count; --loop_count) {
+			auto result = fetch(probe_pos, target_pos);
-				DLOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
+			const Pos low_pos = result.empty() ? end_pos : *result.begin();
-				auto result = fetch(probe_pos, target_pos);
+			const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
-				const Pos low_pos = result.empty() ? end_pos : *result.begin();
+			SEEKER_DEBUG_LOG(" = " << low_pos << ".." << high_pos);
-				const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
+			// check for correct behavior of the fetch function
-				DLOG(" = " << low_pos << ".." << high_pos);
+			THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
-				// check for correct behavior of the fetch function
+			THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
-				THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
+			THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
-				THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
+			if (!found_low) {
-				THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
+				// if target_pos == end_pos then we will find it in every empty result set,
-				if (!found_low) {
+				// so in that case we force the lower bound to be lower than end_pos
-					// if target_pos == end_pos then we will find it in every empty result set,
+				if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
-					// so in that case we force the lower bound to be lower than end_pos
+					// found a lower bound, set the low bound there and switch to binary search
-					if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
+					found_low = true;
-						// found a lower bound, set the low bound there and switch to binary search
+					lower_bound = low_pos;
-						found_low = true;
+					SEEKER_DEBUG_LOG("found_low = true, lower_bound = " << lower_bound);
-						lower_bound = low_pos;
+				} else {
-						DLOG("found_low = true, lower_bound = " << lower_bound);
+					// still looking for lower bound
-					} else {
+					// if probe_pos was begin_pos then we can stop with no result
-						// still looking for lower bound
+					if (probe_pos == begin_pos) {
-						// if probe_pos was begin_pos then we can stop with no result
+						SEEKER_DEBUG_LOG("return: probe_pos == begin_pos " << begin_pos);
-						if (probe_pos == begin_pos) {
+						return begin_pos;
 							DLOG("return: probe_pos == begin_pos " << begin_pos);
 							return begin_pos;
 						}
 						// double the range size, or use the distance between objects found so far
 						THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
 						// already checked low_pos <= high_pos above
 						const Pos want_delta = max(upper_bound - probe_pos, min_step);
 						// avoid underflowing the beginning of the search space
 						const Pos have_delta = min(want_delta, probe_pos - begin_pos);
 						THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
 						// move probe and try again
 						probe_pos = probe_pos - have_delta;
 						DLOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
 						continue;
 					}
 					// double the range size, or use the distance between objects found so far
 					THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
 					// already checked low_pos <= high_pos above
 					const Pos want_delta = max(upper_bound - probe_pos, min_step);
 					// avoid underflowing the beginning of the search space
 					const Pos have_delta = min(want_delta, probe_pos - begin_pos);
 					THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
 					// move probe and try again
 					probe_pos = probe_pos - have_delta;
 					SEEKER_DEBUG_LOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
 					continue;
 				}
 				if (low_pos <= target_pos && target_pos <= high_pos) {
 					// have keys on either side of target_pos in result
 					// search from the high end until we find the highest key below target
 					for (auto i = result.rbegin(); i != result.rend(); ++i) {
 						// more correctness checking for fetch
 						THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
 						if (*i <= target_pos) {
 							DLOG("return: *i " << *i << " <= target_pos " << target_pos);
 							return *i;
 						}
 					}
 					// if the list is empty then low_pos = high_pos = end_pos
 					// if target_pos = end_pos also, then we will execute the loop
 					// above but not find any matching entries.
 					THROW_CHECK0(runtime_error, result.empty());
 				}
 				if (target_pos <= low_pos) {
 					// results are all too high, so probe_pos..low_pos is too high
 					// lower the high bound to the probe pos
 					upper_bound = probe_pos;
 					DLOG("upper_bound = probe_pos " << probe_pos);
 				}
 				if (high_pos < target_pos) {
 					// results are all too low, so probe_pos..high_pos is too low
 					// raise the low bound to the high_pos
 					DLOG("lower_bound = high_pos " << high_pos);
 					lower_bound = high_pos;
 				}
 				// compute a new probe pos at the middle of the range and try again
 				// we can't have a zero-size range here because we would not have set found_low yet
 				THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
 				const Pos delta = (upper_bound - lower_bound) / 2;
 				probe_pos = lower_bound + delta;
 				if (delta < 1) {
 					// nothing can exist in the range (lower_bound, upper_bound)
 					// and an object is known to exist at lower_bound
 					DLOG("return: probe_pos == lower_bound " << lower_bound);
 					return lower_bound;
 				}
 				THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
 				THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
 				DLOG("loop: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
 			}
-			THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
+			if (low_pos <= target_pos && target_pos <= high_pos) {
-				"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
+				// have keys on either side of target_pos in result
-				"found_low " << found_low);
+				// search from the high end until we find the highest key below target
-		} catch (...) {
+				for (auto i = result.rbegin(); i != result.rend(); ++i) {
-			DOUT(cerr);
+					// more correctness checking for fetch
-			throw;
+					THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
 					if (*i <= target_pos) {
 						SEEKER_DEBUG_LOG("return: *i " << *i << " <= target_pos " << target_pos);
 						return *i;
 					}
 				}
 				// if the list is empty then low_pos = high_pos = end_pos
 				// if target_pos = end_pos also, then we will execute the loop
 				// above but not find any matching entries.
 				THROW_CHECK0(runtime_error, result.empty());
 			}
 			if (target_pos <= low_pos) {
 				// results are all too high, so probe_pos..low_pos is too high
 				// lower the high bound to the probe pos, low_pos cannot be lower
 				SEEKER_DEBUG_LOG("upper_bound = probe_pos " << probe_pos);
 				upper_bound = probe_pos;
 			}
 			if (high_pos < target_pos) {
 				// results are all too low, so probe_pos..high_pos is too low
 				// raise the low bound to high_pos but not above upper_bound
 				const auto next_pos = min(high_pos, upper_bound);
 				SEEKER_DEBUG_LOG("lower_bound = next_pos " << next_pos);
 				lower_bound = next_pos;
 			}
 			// compute a new probe pos at the middle of the range and try again
 			// we can't have a zero-size range here because we would not have set found_low yet
 			THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
 			const Pos delta = (upper_bound - lower_bound) / 2;
 			probe_pos = lower_bound + delta;
 			if (delta < 1) {
 				// nothing can exist in the range (lower_bound, upper_bound)
 				// and an object is known to exist at lower_bound
 				SEEKER_DEBUG_LOG("return: probe_pos == lower_bound " << lower_bound);
 				return lower_bound;
 			}
 			THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
 			THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
 			SEEKER_DEBUG_LOG("loop bottom: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
 		}
 		THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
 			"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
 			"found_low " << found_low);
 	}
 }
@@ -0,0 +1,106 @@
 #ifndef CRUCIBLE_TABLE_H
 #define CRUCIBLE_TABLE_H
 #include <functional>
 #include <limits>
 #include <map>
 #include <memory>
 #include <ostream>
 #include <sstream>
 #include <string>
 #include <vector>
 namespace crucible {
 	namespace Table {
 		using namespace std;
 		using Content = function<string(size_t width, size_t height)>;
 		const size_t endpos = numeric_limits<size_t>::max();
 		Content Fill(const char c);
 		Content Text(const string& s);
 		template <class T>
 		Content Number(const T& num)
 		{
 			ostringstream oss;
 			oss << num;
 			return Text(oss.str());
 		}
 		class Cell {
 			Content m_content;
 		public:
 			Cell(const Content &fn = [](size_t, size_t) { return string(); } );
 			Cell& operator=(const Content &fn);
 			string text(size_t width, size_t height) const;
 		};
 		class Dimension {
 			size_t m_next_pos = 0;
 			vector<size_t> m_elements;
 		friend class Table;
 			size_t at(size_t) const;
 		public:
 			size_t size() const;
 			size_t insert(size_t pos);
 			void erase(size_t pos);
 		};
 		class Table {
 			Dimension m_rows, m_cols;
 			map<pair<size_t, size_t>, Cell> m_cells;
 			string m_left = "|";
 			string m_mid = "|";
 			string m_right = "|";
 		public:
 			Dimension &rows();
 			const Dimension& rows() const;
 			Dimension &cols();
 			const Dimension& cols() const;
 			Cell& at(size_t row, size_t col);
 			const Cell& at(size_t row, size_t col) const;
 			template <class T> void insert_row(size_t pos, const T& container);
 			template <class T> void insert_col(size_t pos, const T& container);
 			void left(const string &s);
 			void mid(const string &s);
 			void right(const string &s);
 			const string& left() const;
 			const string& mid() const;
 			const string& right() const;
 		};
 		ostream& operator<<(ostream &os, const Table &table);
 		template <class T>
 		void
 		Table::insert_row(size_t pos, const T& container)
 		{
 			const auto new_pos = m_rows.insert(pos);
 			size_t col = 0;
 			for (const auto &i : container) {
 				if (col >= cols().size()) {
 					cols().insert(col);
 				}
 				at(new_pos, col++) = i;
 			}
 		}
 		template <class T>
 		void
 		Table::insert_col(size_t pos, const T& container)
 		{
 			const auto new_pos = m_cols.insert(pos);
 			size_t row = 0;
 			for (const auto &i : container) {
 				if (row >= rows().size()) {
 					rows().insert(row);
 				}
 				at(row++, new_pos) = i;
 			}
 		}
 	}
 }
 #endif // CRUCIBLE_TABLE_H
@@ -40,10 +40,17 @@ namespace crucible {
 		/// after the current instance exits.
 		void run() const;
 		/// Schedule task to run when no other Task is available.
 		void idle() const;
 		/// Schedule Task to run after this Task has run or
 		/// been destroyed.
 		void append(const Task &task) const;
 		/// Schedule Task to run after this Task has run or
 		/// been destroyed, in Task ID order.
 		void insert(const Task &task) const;
 		/// Describe Task as text.
 		string title() const;
@@ -163,15 +170,12 @@ namespace crucible {
 		/// (it is the ExclusionLock that owns the lock, so it can
 		/// be passed to other Tasks or threads, but this is not
 		/// recommended practice).
-		/// If not successful, current Task is appended to the
+		/// If not successful, the argument Task is appended to the
 		/// task that currently holds the lock.  Current task is
-		/// expected to release any other ExclusionLock
+		/// expected to immediately release any other ExclusionLock
 		/// objects it holds, and exit its Task function.
 		ExclusionLock try_lock(const Task &task);
 		/// Execute Task when Exclusion is unlocked (possibly
 		/// immediately).
 		void insert_task(const Task &t);
 	};
 	/// Wrapper around pthread_setname_np which handles length limits
@@ -34,7 +34,7 @@ namespace crucible {
 		double	m_rate;
 		double	m_burst;
 		double  m_tokens = 0.0;
-		mutex	m_mutex;
+		mutable mutex m_mutex;
 		void update_tokens();
 		RateLimiter() = delete;
@@ -45,6 +45,8 @@ namespace crucible {
 		double sleep_time(double cost = 1.0);
 		bool is_ready();
 		void borrow(double cost = 1.0);
 		void rate(double new_rate);
 		double rate() const;
 	};
 	class RateEstimator {
@@ -88,6 +90,9 @@ namespace crucible {
 		// Read count
 		uint64_t count() const;
 		/// Increment count (like update(count() + more), but atomic)
 		void increment(uint64_t more = 1);
 		// Convert counts to chrono types
 		chrono::high_resolution_clock::time_point time_point(uint64_t absolute_count) const;
 		chrono::duration<double> duration(uint64_t relative_count) const;
@@ -14,9 +14,12 @@ CRUCIBLE_OBJS = \
 	fs.o \
 	multilock.o \
 	ntoa.o \
 	openat2.o \
 	path.o \
 	process.o \
 	seeker.o \
 	string.o \
 	table.o \
 	task.o \
 	time.o \
 	uname.o \
@@ -5,6 +5,12 @@
 #include "crucible/hexdump.h"
 #include "crucible/seeker.h"
 #define CRUCIBLE_BTRFS_TREE_DEBUG(x) do { \
 	if (BtrfsIoctlSearchKey::s_debug_ostream) { \
 		(*BtrfsIoctlSearchKey::s_debug_ostream) << x; \
 	} \
 } while (false)
 namespace crucible {
 	using namespace std;
@@ -22,6 +28,13 @@ namespace crucible {
 		return m_objectid + m_offset;
 	}
 	uint64_t
 	BtrfsTreeItem::extent_flags() const
 	{
 		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_ITEM_KEY);
 		return btrfs_get_member(&btrfs_extent_item::flags, m_data);
 	}
 	uint64_t
 	BtrfsTreeItem::extent_generation() const
 	{
@@ -61,6 +74,13 @@ namespace crucible {
 		return btrfs_get_member(&btrfs_root_item::flags, m_data);
 	}
 	uint64_t
 	BtrfsTreeItem::root_refs() const
 	{
 		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_ITEM_KEY);
 		return btrfs_get_member(&btrfs_root_item::refs, m_data);
 	}
 	ostream &
 	operator<<(ostream &os, const BtrfsTreeItem &bti)
 	{
@@ -137,6 +157,13 @@ namespace crucible {
 		return btrfs_get_member(&btrfs_inode_item::size, m_data);
 	}
 	uint64_t
 	BtrfsTreeItem::inode_flags() const
 	{
 		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_INODE_ITEM_KEY);
 		return btrfs_get_member(&btrfs_inode_item::flags, m_data);
 	}
 	uint64_t
 	BtrfsTreeItem::file_extent_logical_bytes() const
 	{
@@ -269,12 +296,24 @@ namespace crucible {
 		m_type = type;
 	}
 	uint8_t
 	BtrfsTreeFetcher::type()
 	{
 		return m_type;
 	}
 	void
 	BtrfsTreeFetcher::tree(uint64_t tree)
 	{
 		m_tree = tree;
 	}
 	uint64_t
 	BtrfsTreeFetcher::tree()
 	{
 		return m_tree;
 	}
 	void
 	BtrfsTreeFetcher::transid(uint64_t min_transid, uint64_t max_transid)
 	{
@@ -329,6 +368,7 @@ namespace crucible {
 	BtrfsTreeItem
 	BtrfsTreeFetcher::at(uint64_t logical)
 	{
 		CRUCIBLE_BTRFS_TREE_DEBUG("at " << logical);
 		BtrfsIoctlSearchKey &sk = m_sk;
 		fill_sk(sk, logical);
 		// Exact match, should return 0 or 1 items
@@ -371,53 +411,59 @@ namespace crucible {
 	BtrfsTreeFetcher::rlower_bound(uint64_t logical)
 	{
 	#if 0
-	#define BTFRLB_DEBUG(x) do { cerr << x; } while (false)
+		static bool btfrlb_debug = getenv("BTFLRB_DEBUG");
 	#define BTFRLB_DEBUG(x) do { if (btfrlb_debug) cerr << x; } while (false)
 	#else
-	#define BTFRLB_DEBUG(x) do { } while (false)
+	#define BTFRLB_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
 	#endif
 		BtrfsTreeItem closest_item;
 		uint64_t closest_logical = 0;
 		BtrfsIoctlSearchKey &sk = m_sk;
 		size_t loops = 0;
-		BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << endl);
+		BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << " in tree " << tree() << endl);
-		seek_backward(scale_logical(logical), [&](uint64_t lower_bound, uint64_t upper_bound) {
+		seek_backward(scale_logical(logical), [&](uint64_t const lower_bound, uint64_t const upper_bound) {
 			++loops;
 			fill_sk(sk, unscale_logical(min(scaled_max_logical(), lower_bound)));
 			set<uint64_t> rv;
 			bool too_far = false;
 			do {
 				sk.nr_items = 4;
 				sk.do_ioctl(fd());
 				BTFRLB_DEBUG("fetch: loop " << loops << " lower_bound..upper_bound " << to_hex(lower_bound) << ".." << to_hex(upper_bound));
 				for (auto &i : sk.m_result) {
 					next_sk(sk, i);
-					const auto this_logical = hdr_logical(i);
+					// If hdr_stop or !hdr_match, don't inspect the item
-					const auto scaled_hdr_logical = scale_logical(this_logical);
+					if (hdr_stop(i)) {
-					BTFRLB_DEBUG(" " << to_hex(scaled_hdr_logical));
+						too_far = true;
-					if (hdr_match(i)) {
+						rv.insert(numeric_limits<uint64_t>::max());
-						if (this_logical <= logical && this_logical > closest_logical) {
+						BTFRLB_DEBUG("(stop)");
 							closest_logical = this_logical;
 							closest_item = i;
 						}
 						BTFRLB_DEBUG("(match)");
 						rv.insert(scaled_hdr_logical);
 					}
 					if (scaled_hdr_logical > upper_bound || hdr_stop(i)) {
 						if (scaled_hdr_logical >= upper_bound) {
 							BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
 						}
 						if (hdr_stop(i)) {
 							rv.insert(numeric_limits<uint64_t>::max());
 							BTFRLB_DEBUG("(stop)");
 						}
 						break;
 					} else {
 						BTFRLB_DEBUG("(cont'd)");
 					}
 					if (!hdr_match(i)) {
 						BTFRLB_DEBUG("(no match)");
 						continue;
 					}
 					const auto this_logical = hdr_logical(i);
 					BTFRLB_DEBUG(" " << to_hex(this_logical) << " " << i);
 					const auto scaled_hdr_logical = scale_logical(this_logical);
 					BTFRLB_DEBUG(" " << "(match)");
 					if (scaled_hdr_logical > upper_bound) {
 						too_far = true;
 						BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
 						break;
 					}
 					if (this_logical <= logical && this_logical > closest_logical) {
 						closest_logical = this_logical;
 						closest_item = i;
 						BTFRLB_DEBUG("(closest)");
 					}
 					rv.insert(scaled_hdr_logical);
 					BTFRLB_DEBUG("(cont'd)");
 				}
 				BTFRLB_DEBUG(endl);
 				// We might get a search result that contains only non-matching items.
 				// Keep looping until we find any matching item or we run out of tree.
-			} while (rv.empty() && !sk.m_result.empty());
+			} while (!too_far && rv.empty() && !sk.m_result.empty());
 			return rv;
 		}, scale_logical(lookbehind_size()));
 		return closest_item;
@@ -448,6 +494,7 @@ namespace crucible {
 	BtrfsTreeItem
 	BtrfsTreeFetcher::next(uint64_t logical)
 	{
 		CRUCIBLE_BTRFS_TREE_DEBUG("next " << logical);
 		const auto scaled_logical = scale_logical(logical);
 		if (scaled_logical + 1 > scaled_max_logical()) {
 			return BtrfsTreeItem();
@@ -458,6 +505,7 @@ namespace crucible {
 	BtrfsTreeItem
 	BtrfsTreeFetcher::prev(uint64_t logical)
 	{
 		CRUCIBLE_BTRFS_TREE_DEBUG("prev " << logical);
 		const auto scaled_logical = scale_logical(logical);
 		if (scaled_logical < 1) {
 			return BtrfsTreeItem();
@@ -542,9 +590,10 @@ namespace crucible {
 	BtrfsCsumTreeFetcher::get_sums(uint64_t const logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t bytes)> output)
 	{
 	#if 0
-	#define BCTFGS_DEBUG(x) do { cerr << x; } while (false)
+		static bool bctfgs_debug = getenv("BCTFGS_DEBUG");
 	#define BCTFGS_DEBUG(x) do { if (bctfgs_debug) cerr << x; } while (false)
 	#else
-	#define BCTFGS_DEBUG(x) do { } while (false)
+	#define BCTFGS_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
 	#endif
 		const uint64_t logical_end = logical + count * block_size();
 		BtrfsTreeItem bti = rlower_bound(logical);
@@ -636,14 +685,6 @@ namespace crucible {
 		type(BTRFS_EXTENT_DATA_KEY);
 	}
 	BtrfsFsTreeFetcher::BtrfsFsTreeFetcher(const Fd &new_fd, uint64_t subvol) :
 		BtrfsTreeObjectFetcher(new_fd)
 	{
 		tree(subvol);
 		type(BTRFS_EXTENT_DATA_KEY);
 		scale_size(1);
 	}
 	BtrfsInodeFetcher::BtrfsInodeFetcher(const Fd &fd) :
 		BtrfsTreeObjectFetcher(fd)
 	{
@@ -667,18 +708,86 @@ namespace crucible {
 		BtrfsTreeObjectFetcher(fd)
 	{
 		tree(BTRFS_ROOT_TREE_OBJECTID);
 		type(BTRFS_ROOT_ITEM_KEY);
 		scale_size(1);
 	}
 	BtrfsTreeItem
-	BtrfsRootFetcher::root(uint64_t subvol)
+	BtrfsRootFetcher::root(const uint64_t subvol)
 	{
 		const auto my_type = BTRFS_ROOT_ITEM_KEY;
 		type(my_type);
 		const auto item = at(subvol);
 		if (!!item) {
 			THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
-			THROW_CHECK2(runtime_error, item.type(), BTRFS_ROOT_ITEM_KEY, item.type() == BTRFS_ROOT_ITEM_KEY);
+			THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
 		}
 		return item;
 	}
 	BtrfsTreeItem
 	BtrfsRootFetcher::root_backref(const uint64_t subvol)
 	{
 		const auto my_type = BTRFS_ROOT_BACKREF_KEY;
 		type(my_type);
 		const auto item = at(subvol);
 		if (!!item) {
 			THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
 			THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
 		}
 		return item;
 	}
 	BtrfsDataExtentTreeFetcher::BtrfsDataExtentTreeFetcher(const Fd &fd) :
 		BtrfsExtentItemFetcher(fd),
 		m_chunk_tree(fd)
 	{
 		tree(BTRFS_EXTENT_TREE_OBJECTID);
 		type(BTRFS_EXTENT_ITEM_KEY);
 		m_chunk_tree.tree(BTRFS_CHUNK_TREE_OBJECTID);
 		m_chunk_tree.type(BTRFS_CHUNK_ITEM_KEY);
 		m_chunk_tree.objectid(BTRFS_FIRST_CHUNK_TREE_OBJECTID);
 	}
 	void
 	BtrfsDataExtentTreeFetcher::next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr)
 	{
 		key.min_type = key.max_type = type();
 		key.max_objectid = key.max_offset = numeric_limits<uint64_t>::max();
 		key.min_offset = 0;
 		key.min_objectid = hdr.objectid;
 		const auto step = scale_size();
 		if (key.min_objectid < numeric_limits<uint64_t>::max() - step) {
 			key.min_objectid += step;
 		} else {
 			key.min_objectid = numeric_limits<uint64_t>::max();
 		}
 		// If we're still in our current block group, check here
 		if (!!m_current_bg) {
 			const auto bg_begin = m_current_bg.offset();
 			const auto bg_end = bg_begin + m_current_bg.chunk_length();
 			// If we are still in our current block group, return early
 			if (key.min_objectid >= bg_begin && key.min_objectid < bg_end) return;
 		}
 		// We don't have a current block group or we're out of range
 		// Find the chunk that this bytenr belongs to
 		m_current_bg = m_chunk_tree.rlower_bound(key.min_objectid);
 		// Make sure it's a data block group
 		while (!!m_current_bg) {
 			// Data block group, stop here
 			if (m_current_bg.chunk_type() & BTRFS_BLOCK_GROUP_DATA) break;
 			// Not a data block group, skip to end
 			key.min_objectid = m_current_bg.offset() + m_current_bg.chunk_length();
 			m_current_bg = m_chunk_tree.lower_bound(key.min_objectid);
 		}
 		if (!m_current_bg) {
 			// Ran out of data block groups, stop here
 			return;
 		}
 		// Check to see if bytenr is in the current data block group
 		const auto bg_begin = m_current_bg.offset();
 		if (key.min_objectid < bg_begin) {
 			// Move forward to start of data block group
 			key.min_objectid = bg_begin;
 		}
 	}
 }
@@ -44,10 +44,10 @@ namespace crucible {
 	}
 	ByteVector::value_type&
-	ByteVector::operator[](size_t size) const
+	ByteVector::operator[](size_t index) const
 	{
 		unique_lock<mutex> lock(m_mutex);
-		return m_ptr.get()[size];
+		return m_ptr.get()[index];
 	}
 	ByteVector::ByteVector(const ByteVector &that)
@@ -183,7 +183,6 @@ namespace crucible {
 	ostream&
 	operator<<(ostream &os, const ByteVector &bv) {
 		unique_lock<mutex> lock(bv.m_mutex);
 		hexdump(os, bv);
 		return os;
 	}
@@ -76,7 +76,7 @@ namespace crucible {
 			DIE_IF_ZERO(strftime(buf, sizeof(buf), "%Y-%m-%d %H:%M:%S", &ltm));
 			header_stream << buf;
-			header_stream << " " << getpid() << "." << crucible::gettid();
+			header_stream << " " << getpid() << "." << gettid();
 			if (add_prefix_level) {
 				header_stream << "<" << m_loglevel << ">";
 			}
@@ -88,7 +88,7 @@ namespace crucible {
 				header_stream << "<" << m_loglevel << ">";
 			}
 			header_stream << (m_name.empty() ? "thread" : m_name);
-			header_stream << "[" << crucible::gettid() << "]";
+			header_stream << "[" << gettid() << "]";
 		}
 		header_stream << ": ";
@@ -159,12 +159,13 @@ namespace crucible {
 	{
 		THROW_CHECK1(invalid_argument, src_length, src_length > 0);
 		while (src_length > 0) {
-			off_t length = min(off_t(BTRFS_MAX_DEDUPE_LEN), src_length);
+			BtrfsExtentSame bes(src_fd, src_offset, src_length);
 			BtrfsExtentSame bes(src_fd, src_offset, length);
 			bes.add(dst_fd, dst_offset);
 			bes.do_ioctl();
-			auto status = bes.m_info.at(0).status;
+			const auto status = bes.m_info.at(0).status;
 			if (status == 0) {
 				const off_t length = bes.m_info.at(0).bytes_deduped;
 				THROW_CHECK0(invalid_argument, length > 0);
 				src_offset += length;
 				dst_offset += length;
 				src_length -= length;
@@ -333,7 +334,7 @@ namespace crucible {
 		btrfs_ioctl_logical_ino_args args = (btrfs_ioctl_logical_ino_args) {
 			.logical = m_logical,
 			.size = m_container_size,
-			.inodes = reinterpret_cast<uint64_t>(m_container.prepare(m_container_size)),
+			.inodes = reinterpret_cast<uintptr_t>(m_container.prepare(m_container_size)),
 		};
 		// We are still supporting building with old headers that don't have .flags yet
 		*(&args.reserved[0] + 3) = m_flags;
@@ -416,7 +417,7 @@ namespace crucible {
 	{
 		btrfs_ioctl_ino_path_args *p = static_cast<btrfs_ioctl_ino_path_args *>(this);
 		BtrfsDataContainer container(m_container_size);
-		fspath = reinterpret_cast<uint64_t>(container.prepare(m_container_size));
+		fspath = reinterpret_cast<uintptr_t>(container.prepare(m_container_size));
 		size = container.get_size();
 		m_paths.clear();
@@ -753,6 +754,11 @@ namespace crucible {
 		return offset + len;
 	}
 	thread_local size_t BtrfsIoctlSearchKey::s_calls = 0;
 	thread_local size_t BtrfsIoctlSearchKey::s_loops = 0;
 	thread_local size_t BtrfsIoctlSearchKey::s_loops_empty = 0;
 	thread_local shared_ptr<ostream> BtrfsIoctlSearchKey::s_debug_ostream;
 	bool
 	BtrfsIoctlSearchKey::do_ioctl_nothrow(int fd)
 	{
@@ -771,8 +777,17 @@ namespace crucible {
 			ioctl_ptr = ioctl_arg.get<btrfs_ioctl_search_args_v2>();
 			ioctl_ptr->key = static_cast<const btrfs_ioctl_search_key&>(*this);
 			ioctl_ptr->buf_size = buf_size;
 			if (s_debug_ostream) {
 				(*s_debug_ostream) << "bisk " << (ioctl_ptr->key) << "\n";
 			}
 			// Don't bother supporting V1.  Kernels that old have other problems.
 			int rv = ioctl(fd, BTRFS_IOC_TREE_SEARCH_V2, ioctl_arg.data());
 			++s_calls;
 			if (rv != 0 && errno == ENOENT) {
 				// If we are searching a tree that is deleted or no longer exists, just return an empty list
 				ioctl_ptr->key.nr_items = 0;
 				break;
 			}
 			if (rv != 0 && errno != EOVERFLOW) {
 				return false;
 			}
@@ -794,6 +809,10 @@ namespace crucible {
 				buf_size *= 2;
 			}
 			// don't automatically raise the buf size higher than 64K, the largest possible btrfs item
 			++s_loops;
 			if (ioctl_ptr->key.nr_items == 0) {
 				++s_loops_empty;
 			}
 		} while (buf_size < 65536);
 		// ioctl changes nr_items, this has to be copied back
@@ -866,6 +885,26 @@ namespace crucible {
 		}
 	}
 	string
 	btrfs_chunk_type_ntoa(uint64_t type)
 	{
 		static const bits_ntoa_table table[] = {
 			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DATA),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_METADATA),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_SYSTEM),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DUP),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID0),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID10),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C3),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C4),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID5),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID6),
 			NTOA_TABLE_ENTRY_END()
 		};
 		return bits_ntoa(type, table);
 	}
 	string
 	btrfs_search_type_ntoa(unsigned type)
 	{
@@ -893,15 +932,9 @@ namespace crucible {
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_BLOCK_REF_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_DATA_REF_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_BLOCK_GROUP_ITEM_KEY),
 #ifdef BTRFS_FREE_SPACE_INFO_KEY
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_INFO_KEY),
 #endif
 #ifdef BTRFS_FREE_SPACE_EXTENT_KEY
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_EXTENT_KEY),
 #endif
 #ifdef BTRFS_FREE_SPACE_BITMAP_KEY
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_BITMAP_KEY),
 #endif
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_EXTENT_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_ITEM_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_CHUNK_ITEM_KEY),
@@ -933,9 +966,7 @@ namespace crucible {
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_CSUM_TREE_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_QUOTA_TREE_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_UUID_TREE_OBJECTID),
 #ifdef BTRFS_FREE_SPACE_TREE_OBJECTID
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_TREE_OBJECTID),
 #endif
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_BALANCE_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_ORPHAN_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_TREE_LOG_OBJECTID),
@@ -956,6 +987,28 @@ namespace crucible {
 		return bits_ntoa(objectid, table);
 	}
 	string
 	btrfs_inode_flags_ntoa(uint64_t const inode_flags)
 	{
 		static const bits_ntoa_table table[] = {
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_NODATASUM),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_NODATACOW),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_READONLY),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_NOCOMPRESS),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_PREALLOC),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_SYNC),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_IMMUTABLE),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_APPEND),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_NODUMP),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_NOATIME),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_DIRSYNC),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_COMPRESS),
 			NTOA_TABLE_ENTRY_BITS(BTRFS_INODE_ROOT_ITEM_INIT),
 			NTOA_TABLE_ENTRY_END()
 		};
 		return bits_ntoa(inode_flags, table);
 	}
 	ostream &
 	operator<<(ostream &os, const btrfs_ioctl_search_key &key)
 	{
@@ -1123,11 +1176,17 @@ namespace crucible {
 	{
 	}
-	void
+	bool
-	BtrfsIoctlFsInfoArgs::do_ioctl(int fd)
+	BtrfsIoctlFsInfoArgs::do_ioctl_nothrow(int const fd)
 	{
 		btrfs_ioctl_fs_info_args_v3 *p = static_cast<btrfs_ioctl_fs_info_args_v3 *>(this);
-		if (ioctl(fd, BTRFS_IOC_FS_INFO, p)) {
+		return 0 == ioctl(fd, BTRFS_IOC_FS_INFO, p);
 	}
 	void
 	BtrfsIoctlFsInfoArgs::do_ioctl(int const fd)
 	{
 		if (!do_ioctl_nothrow(fd)) {
 			THROW_ERRNO("BTRFS_IOC_FS_INFO: fd " << fd);
 		}
 	}
@@ -1144,6 +1203,13 @@ namespace crucible {
 		return this->btrfs_ioctl_fs_info_args_v3::csum_size;
 	}
 	vector<uint8_t>
 	BtrfsIoctlFsInfoArgs::fsid() const
 	{
 		const auto begin = btrfs_ioctl_fs_info_args_v3::fsid;
 		return vector<uint8_t>(begin, begin + BTRFS_FSID_SIZE);
 	}
 	uint64_t
 	BtrfsIoctlFsInfoArgs::generation() const
 	{
@@ -62,11 +62,22 @@ namespace crucible {
 		return rv;
 	}
 	static MultiLocker s_process_instance;
 	shared_ptr<MultiLocker::LockHandle>
 	MultiLocker::get_lock(const string &type)
 	{
-		static MultiLocker s_process_instance;
+		if (s_process_instance.m_do_locking) {
-		return s_process_instance.get_lock_private(type);
+			return s_process_instance.get_lock_private(type);
 		} else {
 			return shared_ptr<MultiLocker::LockHandle>();
 		}
 	}
 	void
 	MultiLocker::enable_locking(const bool enabled)
 	{
 		s_process_instance.m_do_locking = enabled;
 	}
 }
@@ -0,0 +1,31 @@
 #include "crucible/openat2.h"
 #include <sys/syscall.h>
 // Compatibility for building on old libc for new kernel
 // Every arch that defines this (so far) uses 437, except Alpha, where 437 is
 // mq_getsetattr.
 #ifndef SYS_openat2
 #ifdef __alpha__
 #define SYS_openat2 547
 #else
 #define SYS_openat2 437
 #endif
 #endif
 #include <fcntl.h>
 #include <unistd.h>
 extern "C" {
 int
 __attribute__((weak))
 openat2(int const dirfd, const char *const pathname, struct open_how *const how, size_t const size)
 throw()
 {
 	return syscall(SYS_openat2, dirfd, pathname, how, size);
 }
 };
@@ -7,13 +7,18 @@
 #include <cstdlib>
 #include <utility>
 // for gettid()
 #ifndef _GNU_SOURCE
 #define _GNU_SOURCE
 #endif
 #include <unistd.h>
 #include <sys/syscall.h>
 extern "C" {
 	pid_t
 	__attribute__((weak))
 	gettid() throw()
 	{
 		return syscall(SYS_gettid);
 	}
 };
 namespace crucible {
 	using namespace std;
@@ -111,12 +116,6 @@ namespace crucible {
 		}
 	}
 	pid_t
 	gettid()
 	{
 		return syscall(SYS_gettid);
 	}
 	double
 	getloadavg1()
 	{
@@ -0,0 +1,7 @@
 #include "crucible/seeker.h"
 namespace crucible {
 	thread_local shared_ptr<ostream> tl_seeker_debug_str;
 };
@@ -0,0 +1,254 @@
 #include "crucible/table.h"
 #include "crucible/string.h"
 namespace crucible {
 	namespace Table {
 		using namespace std;
 		Content
 		Fill(const char c)
 		{
 			return [=](size_t width, size_t height) -> string {
 				string rv;
 				while (height--) {
 					rv += string(width, c);
 					if (height) {
 						rv += "\n";
 					}
 				}
 				return rv;
 			};
 		}
 		Content
 		Text(const string &s)
 		{
 			return [=](size_t width, size_t height) -> string {
 				const auto lines = split("\n", s);
 				string rv;
 				size_t line_count = 0;
 				for (const auto &i : lines) {
 					if (line_count++) {
 						rv += "\n";
 					}
 					if (i.length() < width) {
 						rv += string(width - i.length(), ' ');
 					}
 					rv += i;
 				}
 				while (line_count < height) {
 					if (line_count++) {
 						rv += "\n";
 					}
 					rv += string(width, ' ');
 				}
 				return rv;
 			};
 		}
 		Content
 		Number(const string &s)
 		{
 			return [=](size_t width, size_t height) -> string {
 				const auto lines = split("\n", s);
 				string rv;
 				size_t line_count = 0;
 				for (const auto &i : lines) {
 					if (line_count++) {
 						rv += "\n";
 					}
 					if (i.length() < width) {
 						rv += string(width - i.length(), ' ');
 					}
 					rv += i;
 				}
 				while (line_count < height) {
 					if (line_count++) {
 						rv += "\n";
 					}
 					rv += string(width, ' ');
 				}
 				return rv;
 			};
 		}
 		Cell::Cell(const Content &fn) :
 			m_content(fn)
 		{
 		}
 		Cell&
 		Cell::operator=(const Content &fn)
 		{
 			m_content = fn;
 			return *this;
 		}
 		string
 		Cell::text(size_t width, size_t height) const
 		{
 			return m_content(width, height);
 		}
 		size_t
 		Dimension::size() const
 		{
 			return m_elements.size();
 		}
 		size_t
 		Dimension::insert(size_t pos)
 		{
 			++m_next_pos;
 			const auto insert_pos = min(m_elements.size(), pos);
 			const auto it = m_elements.begin() + insert_pos;
 			m_elements.insert(it, m_next_pos);
 			return insert_pos;
 		}
 		void
 		Dimension::erase(size_t pos)
 		{
 			const auto it = m_elements.begin() + min(m_elements.size(), pos);
 			m_elements.erase(it);
 		}
 		size_t
 		Dimension::at(size_t pos) const
 		{
 			return m_elements.at(pos);
 		}
 		Dimension&
 		Table::rows()
 		{
 			return m_rows;
 		};
 		const Dimension&
 		Table::rows() const
 		{
 			return m_rows;
 		};
 		Dimension&
 		Table::cols()
 		{
 			return m_cols;
 		};
 		const Dimension&
 		Table::cols() const
 		{
 			return m_cols;
 		};
 		const Cell&
 		Table::at(size_t row, size_t col) const
 		{
 			const auto row_idx = m_rows.at(row);
 			const auto col_idx = m_cols.at(col);
 			const auto found = m_cells.find(make_pair(row_idx, col_idx));
 			if (found == m_cells.end()) {
 				static const Cell s_empty(Fill('.'));
 				return s_empty;
 			}
 			return found->second;
 		};
 		Cell&
 		Table::at(size_t row, size_t col)
 		{
 			const auto row_idx = m_rows.at(row);
 			const auto col_idx = m_cols.at(col);
 			return m_cells[make_pair(row_idx, col_idx)];
 		};
 		static
 		pair<size_t, size_t>
 		text_size(const string &s)
 		{
 			const auto s_split = split("\n", s);
 			size_t width = 0;
 			for (const auto &i : s_split) {
 				width = max(width, i.length());
 			}
 			return make_pair(width, s_split.size());
 		}
 		ostream& operator<<(ostream &os, const Table &table)
 		{
 			const auto rows = table.rows().size();
 			const auto cols = table.cols().size();
 			vector<size_t> row_heights(rows, 1);
 			vector<size_t> col_widths(cols, 1);
 			// Get the size of all fixed- and minimum-sized content cells
 			for (size_t row = 0; row < table.rows().size(); ++row) {
 				vector<string> col_text;
 				for (size_t col = 0; col < table.cols().size(); ++col) {
 					col_text.push_back(table.at(row, col).text(0, 0));
 					const auto tsize = text_size(*col_text.rbegin());
 					row_heights[row] = max(row_heights[row], tsize.second);
 					col_widths[col] = max(col_widths[col], tsize.first);
 				}
 			}
 			// Render the table
 			for (size_t row = 0; row < table.rows().size(); ++row) {
 				vector<string> lines(row_heights[row], "");
 				for (size_t col = 0; col < table.cols().size(); ++col) {
 					const auto& table_cell = table.at(row, col);
 					const auto table_text = table_cell.text(col_widths[col], row_heights[row]);
 					auto col_lines = split("\n", table_text);
 					col_lines.resize(row_heights[row], "");
 					for (size_t line = 0; line < row_heights[row]; ++line) {
 						if (col > 0) {
 							lines[line] += table.mid();
 						}
 						lines[line] += col_lines[line];
 					}
 				}
 				for (const auto &line : lines) {
 					os << table.left() << line << table.right() << "\n";
 				}
 			}
 			return os;
 		}
 		void
 		Table::left(const string &s)
 		{
 			m_left = s;
 		}
 		void
 		Table::mid(const string &s)
 		{
 			m_mid = s;
 		}
 		void
 		Table::right(const string &s)
 		{
 			m_right = s;
 		}
 		const string&
 		Table::left() const
 		{
 			return m_left;
 		}
 		const string&
 		Table::mid() const
 		{
 			return m_mid;
 		}
 		const string&
 		Table::right() const
 		{
 			return m_right;
 		}
 	}
 }
@@ -76,13 +76,24 @@ namespace crucible {
 		/// Tasks to be executed after the current task is executed
 		list<TaskStatePtr>			m_post_exec_queue;
-		/// Set by run() and append().  Cleared by exec().
+		/// Set by run(), append(), and insert().  Cleared by exec().
 		bool					m_run_now = false;
 		/// Set by insert().  Cleared by exec() and destructor.
 		bool					m_sort_queue = false;
 		/// Set when task starts execution by exec().
 		/// Cleared when exec() ends.
 		bool					m_is_running = false;
 		/// Set when task is queued while already running.
 		/// Cleared when task is requeued.
 		bool					m_run_again = false;
 		/// Set when task is queued as idle task while already running.
 		/// Cleared when task is queued as non-idle task.
 		bool					m_idle = false;
 		/// Sequential identifier for next task
 		static atomic<TaskId>			s_next_id;
@@ -107,7 +118,7 @@ namespace crucible {
 		static void clear_queue(TaskQueue &tq);
 		/// Rescue any TaskQueue, not just this one.
-		static void rescue_queue(TaskQueue &tq);
+		static void rescue_queue(TaskQueue &tq, const bool sort_queue);
 		TaskState &operator=(const TaskState &) = delete;
 		TaskState(const TaskState &) = delete;
@@ -124,6 +135,9 @@ namespace crucible {
 		/// instance at the end of TaskMaster's global queue.
 		void run();
 		/// Run the task when there are no more Tasks on the main queue.
 		void idle();
 		/// Execute task immediately in current thread if it is not already
 		/// executing in another thread; otherwise, append the current task
 		/// to itself to be executed immediately in the other thread.
@@ -139,6 +153,10 @@ namespace crucible {
 		/// or is destroyed.
 		void append(const TaskStatePtr &task);
 		/// Queue task to execute after current task finishes executing
 		/// or is destroyed, in task ID order.
 		void insert(const TaskStatePtr &task);
 		/// How masy Tasks are there?  Good for catching leaks
 		static size_t instance_count();
 	};
@@ -150,6 +168,7 @@ namespace crucible {
 		mutex 					m_mutex;
 		condition_variable 			m_condvar;
 		TaskQueue				m_queue;
 		TaskQueue				m_idle_queue;
 		size_t					m_thread_max;
 		size_t					m_thread_min = 0;
 		set<TaskConsumerPtr>			m_threads;
@@ -184,6 +203,7 @@ namespace crucible {
 		TaskMasterState(size_t thread_max = thread::hardware_concurrency());
 		static void push_back(const TaskStatePtr &task);
 		static void push_back_idle(const TaskStatePtr &task);
 		static void push_front(TaskQueue &queue);
 		size_t get_queue_count();
 		size_t get_thread_count();
@@ -214,16 +234,21 @@ namespace crucible {
 	static auto s_tms = make_shared<TaskMasterState>();
 	void
-	TaskState::rescue_queue(TaskQueue &queue)
+	TaskState::rescue_queue(TaskQueue &queue, const bool sort_queue)
 	{
 		if (queue.empty()) {
 			return;
 		}
-		const auto tlcc = tl_current_consumer;
+		const auto &tlcc = tl_current_consumer;
 		if (tlcc) {
 			// We are executing under a TaskConsumer, splice our post-exec queue at front.
 			// No locks needed because we are using only thread-local objects.
 			tlcc->m_local_queue.splice(tlcc->m_local_queue.begin(), queue);
 			if (sort_queue) {
 				tlcc->m_local_queue.sort([&](const TaskStatePtr &a, const TaskStatePtr &b) {
 					return a->m_id < b->m_id;
 				});
 			}
 		} else {
 			// We are not executing under a TaskConsumer.
 			// If there is only one task, then just insert it at the front of the queue.
@@ -234,6 +259,8 @@ namespace crucible {
 				// then push it to the front of the global queue using normal locking methods.
 				TaskStatePtr rescue_task(make_shared<TaskState>("rescue_task", [](){}));
 				swap(rescue_task->m_post_exec_queue, queue);
 				// Do the sort--once--when a new Consumer has picked up the Task
 				rescue_task->m_sort_queue = sort_queue;
 				TaskQueue tq_one { rescue_task };
 				TaskMasterState::push_front(tq_one);
 			}
@@ -246,7 +273,8 @@ namespace crucible {
 		--s_instance_count;
 		unique_lock<mutex> lock(m_mutex);
 		// If any dependent Tasks were appended since the last exec, run them now
-		TaskState::rescue_queue(m_post_exec_queue);
+		TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
 		// No need to clear m_sort_queue here, it won't exist soon
 	}
 	TaskState::TaskState(string title, function<void()> exec_fn) :
@@ -305,6 +333,24 @@ namespace crucible {
 			task->m_run_now = true;
 			append_nolock(task);
 		}
 		task->m_idle = false;
 	}
 	void
 	TaskState::insert(const TaskStatePtr &task)
 	{
 		THROW_CHECK0(invalid_argument, task);
 		THROW_CHECK2(invalid_argument, m_id, task->m_id, m_id != task->m_id);
 		PairLock lock(m_mutex, task->m_mutex);
 		if (!task->m_run_now) {
 			task->m_run_now = true;
 			// Move the task and its post-exec queue to follow this task,
 			// and request a sort of the flattened list.
 			m_sort_queue = true;
 			m_post_exec_queue.push_back(task);
 			m_post_exec_queue.splice(m_post_exec_queue.end(), task->m_post_exec_queue);
 		}
 		task->m_idle = false;
 	}
 	void
@@ -315,7 +361,7 @@ namespace crucible {
 		unique_lock<mutex> lock(m_mutex);
 		if (m_is_running) {
-			append_nolock(shared_from_this());
+			m_run_again = true;
 			return;
 		} else {
 			m_run_now = false;
@@ -339,8 +385,20 @@ namespace crucible {
 		swap(this_task, tl_current_task);
 		m_is_running = false;
 		if (m_run_again) {
 			m_run_again = false;
 			if (m_idle) {
 				// All the way back to the end of the line
 				TaskMasterState::push_back_idle(shared_from_this());
 			} else {
 				// Insert after any dependents waiting for this Task
 				m_post_exec_queue.push_back(shared_from_this());
 			}
 		}
 		// Splice task post_exec queue at front of local queue
-		TaskState::rescue_queue(m_post_exec_queue);
+		TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
 		m_sort_queue = false;
 	}
 	string
@@ -360,11 +418,32 @@ namespace crucible {
 	TaskState::run()
 	{
 		unique_lock<mutex> lock(m_mutex);
 		m_idle = false;
 		if (m_run_now) {
 			return;
 		}
 		m_run_now = true;
-		TaskMasterState::push_back(shared_from_this());
+		if (m_is_running) {
 			m_run_again = true;
 		} else {
 			TaskMasterState::push_back(shared_from_this());
 		}
 	}
 	void
 	TaskState::idle()
 	{
 		unique_lock<mutex> lock(m_mutex);
 		m_idle = true;
 		if (m_run_now) {
 			return;
 		}
 		m_run_now = true;
 		if (m_is_running) {
 			m_run_again = true;
 		} else {
 			TaskMasterState::push_back_idle(shared_from_this());
 		}
 	}
 	TaskMasterState::TaskMasterState(size_t thread_max) :
@@ -410,6 +489,20 @@ namespace crucible {
 		s_tms->start_threads_nolock();
 	}
 	void
 	TaskMasterState::push_back_idle(const TaskStatePtr &task)
 	{
 		THROW_CHECK0(runtime_error, task);
 		unique_lock<mutex> lock(s_tms->m_mutex);
 		if (s_tms->m_cancelled) {
 			task->clear();
 			return;
 		}
 		s_tms->m_idle_queue.push_back(task);
 		s_tms->m_condvar.notify_all();
 		s_tms->start_threads_nolock();
 	}
 	void
 	TaskMasterState::push_front(TaskQueue &queue)
 	{
@@ -456,12 +549,26 @@ namespace crucible {
 	TaskMaster::print_queue(ostream &os)
 	{
 		unique_lock<mutex> lock(s_tms->m_mutex);
-		os << "Queue (size " << s_tms->m_queue.size() << "):" << endl;
+		auto queue_copy = s_tms->m_queue;
 		lock.unlock();
 		os << "Queue (size " << queue_copy.size() << "):" << endl;
 		size_t counter = 0;
-		for (auto i : s_tms->m_queue) {
+		for (auto i : queue_copy) {
 			os << "Queue #" << ++counter << " Task ID " << i->id() << " " << i->title() << endl;
 		}
-		return os << "Queue End" << endl;
+		os << "Queue End" << endl;
 		lock.lock();
 		queue_copy = s_tms->m_idle_queue;
 		lock.unlock();
 		os << "Idle (size " << queue_copy.size() << "):" << endl;
 		counter = 0;
 		for (const auto &i : queue_copy) {
 			os << "Idle #" << ++counter << " Task ID " << i->id() << " " << i->title() << endl;
 		}
 		os << "Idle End" << endl;
 		return os;
 	}
 	ostream &
@@ -486,11 +593,6 @@ namespace crucible {
 	size_t
 	TaskMasterState::calculate_thread_count_nolock()
 	{
 		if (m_paused) {
 			// No threads running while paused or cancelled
 			return 0;
 		}
 		if (m_load_target == 0) {
 			// No limits, no stats, use configured thread count
 			return m_configured_thread_max;
@@ -583,6 +685,7 @@ namespace crucible {
 		m_cancelled = true;
 		decltype(m_queue) empty_queue;
 		m_queue.swap(empty_queue);
 		empty_queue.splice(empty_queue.end(), m_idle_queue);
 		m_condvar.notify_all();
 		lock.unlock();
 		TaskState::clear_queue(empty_queue);
@@ -600,6 +703,9 @@ namespace crucible {
 		unique_lock<mutex> lock(m_mutex);
 		m_paused = paused;
 		m_condvar.notify_all();
 		if (!m_paused) {
 			start_threads_nolock();
 		}
 		lock.unlock();
 	}
@@ -648,7 +754,7 @@ namespace crucible {
 		m_prev_loadavg = getloadavg1();
 		if (target && !m_load_tracking_thread) {
-			m_load_tracking_thread = make_shared<thread>([=] () { loadavg_thread_fn(); });
+			m_load_tracking_thread = make_shared<thread>([this] () { loadavg_thread_fn(); });
 			m_load_tracking_thread->detach();
 		}
 	}
@@ -682,6 +788,13 @@ namespace crucible {
 		m_task_state->run();
 	}
 	void
 	Task::idle() const
 	{
 		THROW_CHECK0(runtime_error, m_task_state);
 		m_task_state->idle();
 	}
 	void
 	Task::append(const Task &that) const
 	{
@@ -690,6 +803,14 @@ namespace crucible {
 		m_task_state->append(that.m_task_state);
 	}
 	void
 	Task::insert(const Task &that) const
 	{
 		THROW_CHECK0(runtime_error, m_task_state);
 		THROW_CHECK0(runtime_error, that);
 		m_task_state->insert(that.m_task_state);
 	}
 	Task
 	Task::current_task()
 	{
@@ -772,6 +893,9 @@ namespace crucible {
 			} else if (!master_copy->m_queue.empty()) {
 				m_current_task = *master_copy->m_queue.begin();
 				master_copy->m_queue.pop_front();
 			} else if (!master_copy->m_idle_queue.empty()) {
 				m_current_task = *master_copy->m_idle_queue.begin();
 				master_copy->m_idle_queue.pop_front();
 			} else {
 				master_copy->m_condvar.wait(lock);
 				continue;
@@ -801,11 +925,13 @@ namespace crucible {
 		swap(this_consumer, tl_current_consumer);
 		assert(!tl_current_consumer);
-		// Release lock to rescue queue (may attempt to queue a new task at TaskMaster).
+		// Release lock to rescue queue (may attempt to queue a
-		// rescue_queue normally sends tasks to the local queue of the current TaskConsumer thread,
+		// new task at TaskMaster).  rescue_queue normally sends
-		// but we just disconnected ourselves from that.
+		// tasks to the local queue of the current TaskConsumer
 		// thread, but we just disconnected ourselves from that.
 		// No sorting here because this is not a TaskState.
 		lock.unlock();
-		TaskState::rescue_queue(m_local_queue);
+		TaskState::rescue_queue(m_local_queue, false);
 		// Hold lock so we can erase ourselves
 		lock.lock();
@@ -818,7 +944,7 @@ namespace crucible {
 	TaskConsumer::TaskConsumer(const shared_ptr<TaskMasterState> &tms) :
 		m_master(tms)
 	{
-		m_thread = make_shared<thread>([=](){ consumer_thread(); });
+		m_thread = make_shared<thread>([this](){ consumer_thread(); });
 	}
 	class BarrierState {
@@ -883,21 +1009,6 @@ namespace crucible {
 		m_owner.reset();
 	}
 	void
 	Exclusion::insert_task(const Task &task)
 	{
 		unique_lock<mutex> lock(m_mutex);
 		const auto sp = m_owner.lock();
 		lock.unlock();
 		if (sp) {
 			// If Exclusion is locked then queue task for release;
 			sp->append(task);
 		} else {
 			// otherwise, run the inserted task immediately
 			task.run();
 		}
 	}
 	ExclusionLock
 	Exclusion::try_lock(const Task &task)
 	{
@@ -905,7 +1016,7 @@ namespace crucible {
 		const auto sp = m_owner.lock();
 		if (sp) {
 			if (task) {
-				sp->append(task);
+				sp->insert(task);
 			}
 			return ExclusionLock();
 		} else {
@@ -98,12 +98,16 @@ namespace crucible {
 		m_rate(rate),
 		m_burst(burst)
 	{
 		THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
 		THROW_CHECK1(invalid_argument, m_burst, m_burst >= 0);
 	}
 	RateLimiter::RateLimiter(double rate) :
 		m_rate(rate),
 		m_burst(rate)
 	{
 		THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
 		THROW_CHECK1(invalid_argument, m_burst, m_burst >= 0);
 	}
 	void
@@ -119,6 +123,7 @@ namespace crucible {
 	double
 	RateLimiter::sleep_time(double cost)
 	{
 		THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
 		borrow(cost);
 		unique_lock<mutex> lock(m_mutex);
 		update_tokens();
@@ -154,6 +159,21 @@ namespace crucible {
 		m_tokens -= cost;
 	}
 	void
 	RateLimiter::rate(double const new_rate)
 	{
 		THROW_CHECK1(invalid_argument, new_rate, new_rate > 0);
 		unique_lock<mutex> lock(m_mutex);
 		m_rate = new_rate;
 	}
 	double
 	RateLimiter::rate() const
 	{
 		unique_lock<mutex> lock(m_mutex);
 		return m_rate;
 	}
 	RateEstimator::RateEstimator(double min_delay, double max_delay) :
 		m_min_delay(min_delay),
 		m_max_delay(max_delay)
@@ -202,6 +222,13 @@ namespace crucible {
 		}
 	}
 	void
 	RateEstimator::increment(const uint64_t more)
 	{
 		unique_lock<mutex> lock(m_mutex);
 		return update_unlocked(m_last_count + more);
 	}
 	uint64_t
 	RateEstimator::count() const
 	{
@@ -1,5 +1,13 @@
 #!/bin/bash
 # if not called from systemd try to replicate mount unsharing on ctrl+c
 # see: https://github.com/Zygo/bees/issues/281
 if [ -z "${SYSTEMD_EXEC_PID}" -a -z "${UNSHARE_DONE}" ]; then
        UNSHARE_DONE=true
        export UNSHARE_DONE
        exec unshare -m --propagation private -- "$0" "$@"
 fi
 ## Helpful functions
 INFO(){ echo "INFO:" "$@"; }
 ERRO(){ echo "ERROR:" "$@"; exit 1; }
@@ -108,13 +116,11 @@ mkdir -p "$WORK_DIR" || exit 1
 INFO "MOUNT DIR: $MNT_DIR"
 mkdir -p "$MNT_DIR" || exit 1
-mount --make-private -osubvolid=5 /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
+mount --make-private -osubvolid=5,nodev,noexec /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
 if [ ! -d "$BEESHOME" ]; then
    INFO "Create subvol $BEESHOME for store bees data"
    btrfs sub cre "$BEESHOME"
 else
    btrfs sub show "$BEESHOME" &> /dev/null || ERRO "$BEESHOME MUST BE A SUBVOL!"
 fi
 # Check DB size
@@ -5,7 +5,7 @@ After=sysinit.target
 [Service]
 Type=simple
-ExecStart=@PREFIX@/sbin/beesd --no-timestamps %i
+ExecStart=@PREFIX@/@BINDIR@/beesd --no-timestamps %i
 CPUAccounting=true
 CPUSchedulingPolicy=batch
 CPUWeight=12
@@ -17,6 +17,7 @@ KillSignal=SIGTERM
 MemoryAccounting=true
 Nice=19
 Restart=on-abnormal
 RuntimeDirectoryMode=0700
 RuntimeDirectory=bees
 StartupCPUWeight=25
 StartupIOWeight=25
@@ -20,7 +20,6 @@
 using namespace crucible;
 using namespace std;
 BeesFdCache::BeesFdCache(shared_ptr<BeesContext> ctx) :
 	m_ctx(ctx)
 {
@@ -98,6 +97,9 @@ BeesContext::dump_status()
 		TaskMaster::print_queue(ofs);
 #endif
 		ofs << "PROGRESS:\n";
 		ofs << get_progress();
 		ofs.close();
 		BEESNOTE("renaming status file '" << status_file << "'");
@@ -112,6 +114,23 @@ BeesContext::dump_status()
 	}
 }
 void
 BeesContext::set_progress(const string &str)
 {
 	unique_lock<mutex> lock(m_progress_mtx);
 	m_progress_str = str;
 }
 string
 BeesContext::get_progress()
 {
 	unique_lock<mutex> lock(m_progress_mtx);
 	if (m_progress_str.empty()) {
 		return "[No progress estimate available]\n";
 	}
 	return m_progress_str;
 }
 void
 BeesContext::show_progress()
 {
@@ -159,6 +178,8 @@ BeesContext::show_progress()
 			BEESLOGINFO("\ttid " << t.first << ": " << t.second);
 		}
 		// No need to log progress here, it is logged when set
 		lastStats = thisStats;
 	}
 }
@@ -182,7 +203,7 @@ BeesContext::home_fd()
 }
 bool
-BeesContext::is_root_ro(uint64_t root)
+BeesContext::is_root_ro(uint64_t const root)
 {
 	return roots()->is_root_ro(root);
 }
@@ -192,6 +213,7 @@ BeesContext::dedup(const BeesRangePair &brp_in)
 {
 	// TOOLONG and NOTE can retroactively fill in the filename details, but LOG can't
 	BEESNOTE("dedup " << brp_in);
 	BEESTRACE("dedup " << brp_in);
 	if (is_root_ro(brp_in.second.fid().root())) {
 		// BEESLOGDEBUG("WORKAROUND: dst root " << (brp_in.second.fid().root()) << " is read-only);
@@ -208,8 +230,10 @@ BeesContext::dedup(const BeesRangePair &brp_in)
 	BeesAddress first_addr(brp.first.fd(), brp.first.begin());
 	BeesAddress second_addr(brp.second.fd(), brp.second.begin());
-	if (first_addr.get_physical_or_zero() == second_addr.get_physical_or_zero()) {
+	const auto first_gpoz = first_addr.get_physical_or_zero();
-		BEESLOGTRACE("equal physical addresses in dedup");
+	const auto second_gpoz = second_addr.get_physical_or_zero();
 	if (first_gpoz == second_gpoz) {
 		BEESLOGDEBUG("equal physical addresses " << first_addr << " and " << second_addr << " in dedup");
 		BEESCOUNT(bug_dedup_same_physical);
 	}
@@ -219,27 +243,40 @@ BeesContext::dedup(const BeesRangePair &brp_in)
 	BEESCOUNT(dedup_try);
 	BEESNOTE("waiting to dedup " << brp);
-	const auto lock = MultiLocker::get_lock("dedupe");
+	auto lock = MultiLocker::get_lock("dedupe");
 	Timer dedup_timer;
 	BEESLOGINFO("dedup: src " << pretty(brp.first.size())  << " [" << to_hex(brp.first.begin())  << ".." << to_hex(brp.first.end())  << "] {" << first_addr  << "} " << name_fd(brp.first.fd()) << "\n"
 		 << "       dst " << pretty(brp.second.size()) << " [" << to_hex(brp.second.begin()) << ".." << to_hex(brp.second.end()) << "] {" << second_addr << "} " << name_fd(brp.second.fd()));
 	BEESNOTE("dedup: src " << pretty(brp.first.size())  << " [" << to_hex(brp.first.begin())  << ".." << to_hex(brp.first.end())  << "] {" << first_addr  << "} " << name_fd(brp.first.fd()) << "\n"
 		 << "       dst " << pretty(brp.second.size()) << " [" << to_hex(brp.second.begin()) << ".." << to_hex(brp.second.end()) << "] {" << second_addr << "} " << name_fd(brp.second.fd()));
-	const bool rv = btrfs_extent_same(brp.first.fd(), brp.first.begin(), brp.first.size(), brp.second.fd(), brp.second.begin());
+	while (true) {
-	BEESCOUNTADD(dedup_ms, dedup_timer.age() * 1000);
+		try {
 			Timer dedup_timer;
 			const bool rv = btrfs_extent_same(brp.first.fd(), brp.first.begin(), brp.first.size(), brp.second.fd(), brp.second.begin());
 			BEESCOUNTADD(dedup_ms, dedup_timer.age() * 1000);
-	if (rv) {
+			if (rv) {
-		BEESCOUNT(dedup_hit);
+				BEESCOUNT(dedup_hit);
-		BEESCOUNTADD(dedup_bytes, brp.first.size());
+				BEESCOUNTADD(dedup_bytes, brp.first.size());
-	} else {
+			} else {
-		BEESCOUNT(dedup_miss);
+				BEESCOUNT(dedup_miss);
-		BEESLOGWARN("NO Dedup! " << brp);
+				BEESLOGINFO("NO Dedup! " << brp);
 			}
 			lock.reset();
 			bees_throttle(dedup_timer.age(), "dedup");
 			return rv;
 		} catch (const std::system_error &e) {
 			if (e.code().value() == EAGAIN) {
 				BEESNOTE("dedup waiting for btrfs send on " << brp.second);
 				BEESLOGDEBUG("dedup waiting for btrfs send on " << brp.second);
 				roots()->wait_for_transid(1);
 			} else {
 				throw;
 			}
 		}
 	}
 	return rv;
 }
 BeesRangePair
@@ -264,6 +301,7 @@ BeesContext::rewrite_file_range(const BeesFileRange &bfr)
 	// BEESLOG("BeesResolver br(..., " << bfr << ")");
 	BEESTRACE("BeesContext::rewrite_file_range calling BeesResolver " << bfr);
 	BeesResolver br(m_ctx, BeesAddress(bfr.fd(), bfr.begin()));
 	BEESTRACE("BeesContext::rewrite_file_range calling replace_src " << dup_bbd);
 	// BEESLOG("\treplace_src " << dup_bbd);
 	br.replace_src(dup_bbd);
 	BEESCOUNT(scan_rewrite);
@@ -291,23 +329,38 @@ BeesContext::rewrite_file_range(const BeesFileRange &bfr)
 	}
 }
-BeesFileRange
+struct BeesSeenRange {
 	uint64_t bytenr;
 	off_t offset;
 	off_t length;
 };
 static
 bool
 operator<(const BeesSeenRange &bsr1, const BeesSeenRange &bsr2)
 {
 	return tie(bsr1.bytenr, bsr1.offset, bsr1.length) < tie(bsr2.bytenr, bsr2.offset, bsr2.length);
 }
 static
 __attribute__((unused))
 ostream&
 operator<<(ostream &os, const BeesSeenRange &tup)
 {
 	return os << "BeesSeenRange { " << to_hex(tup.bytenr) << ", " << to_hex(tup.offset) << "+" << pretty(tup.length) << " }";
 }
 void
 BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 {
 	BEESNOTE("Scanning " << pretty(e.size()) << " "
 		<< to_hex(e.begin()) << ".." << to_hex(e.end())
 		<< " " << name_fd(bfr.fd()) );
 	BEESTRACE("scan extent " << e);
 	BEESTRACE("scan bfr " << bfr);
 	BEESCOUNT(scan_extent);
-	// EXPERIMENT:  Don't bother with tiny extents unless they are the entire file.
+	Timer one_timer;
 	// We'll take a tiny extent at BOF or EOF but not in between.
 	if (e.begin() && e.size() < 128 * 1024 && e.end() != Stat(bfr.fd()).st_size) {
 		BEESCOUNT(scan_extent_tiny);
 		// This doesn't work properly with the current architecture,
 		// so we don't do an early return here.
 		// return bfr;
 	}
 	// We keep moving this method around
 	auto m_ctx = shared_from_this();
@@ -322,19 +375,19 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 		Extent::OBSCURED | Extent::PREALLOC
 	)) {
 		BEESCOUNT(scan_interesting);
-		BEESLOGWARN("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
+		BEESLOGINFO("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
 	}
 	if (e.flags() & Extent::HOLE) {
 		// Nothing here, dispose of this early
 		BEESCOUNT(scan_hole);
-		return bfr;
+		return;
 	}
 	if (e.flags() & Extent::PREALLOC) {
 		// Prealloc is all zero and we replace it with a hole.
 		// No special handling is required here.  Nuke it and move on.
-		BEESLOGINFO("prealloc extent " << e);
+		BEESLOGINFO("prealloc extent " << e << " in " << bfr);
 		// Must not extend past EOF
 		auto extent_size = min(e.end(), bfr.file_size()) - e.begin();
 		// Must hold tmpfile until dedupe is done
@@ -347,38 +400,57 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 		if (m_ctx->dedup(brp)) {
 			BEESCOUNT(dedup_prealloc_hit);
 			BEESCOUNTADD(dedup_prealloc_bytes, e.size());
-			return bfr;
+			return;
 		} else {
 			BEESCOUNT(dedup_prealloc_miss);
 		}
 	}
 	// If we already read this extent and inserted it into the hash table, no need to read it again
 	static mutex s_seen_mutex;
 	unique_lock<mutex> lock_seen(s_seen_mutex);
 	const BeesSeenRange tup = {
 		.bytenr = e.bytenr(),
 		.offset = e.offset(),
 		.length = e.size(),
 	};
 	static set<BeesSeenRange> s_seen;
 	if (s_seen.size() > BEES_MAX_EXTENT_REF_COUNT) {
 		s_seen.clear();
 		BEESCOUNT(scan_seen_clear);
 	}
 	const auto seen_rv = s_seen.find(tup) != s_seen.end();
 	if (!seen_rv) {
 		BEESCOUNT(scan_seen_miss);
 	} else {
 		// BEESLOGDEBUG("Skip " << tup << " " << e);
 		BEESCOUNT(scan_seen_hit);
 		return;
 	}
 	lock_seen.unlock();
 	// OK we need to read extent now
 	bees_readahead(bfr.fd(), bfr.begin(), bfr.size());
 	map<off_t, pair<BeesHash, BeesAddress>> insert_map;
-	set<off_t> noinsert_set;
+	set<off_t> dedupe_set;
-
+	set<off_t> zero_set;
 	// Hole handling
 	bool extent_compressed = e.flags() & FIEMAP_EXTENT_ENCODED;
 	bool extent_contains_zero = false;
 	bool extent_contains_nonzero = false;
 	// Need to replace extent
 	bool rewrite_extent = false;
 	// Pretty graphs
 	off_t block_count = ((e.size() + BLOCK_MASK_SUMS) & ~BLOCK_MASK_SUMS) / BLOCK_SIZE_SUMS;
 	BEESTRACE(e << " block_count " << block_count);
 	string bar(block_count, '#');
-	for (off_t next_p = e.begin(); next_p < e.end(); ) {
+	// List of dedupes found
 	list<BeesRangePair> dedupe_list;
 	list<BeesFileRange> copy_list;
 	list<pair<BeesHash, BeesAddress>> front_hash_list;
 	list<uint64_t> invalidate_addr_list;
-		// Guarantee forward progress
+	off_t next_p = e.begin();
-		off_t p = next_p;
+	for (off_t p = e.begin(); p < e.end(); p += BLOCK_SIZE_SUMS) {
 		next_p += BLOCK_SIZE_SUMS;
-		off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
+		const off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
 		BeesAddress addr(e, p);
 		// This extent should consist entirely of non-magic blocks
@@ -393,69 +465,68 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 		// Calculate the hash first because it lets us shortcut on is_data_zero
 		BEESNOTE("scan hash " << bbd);
-		BeesHash hash = bbd.hash();
+		const BeesHash hash = bbd.hash();
 		// Weed out zero blocks
 		BEESNOTE("is_data_zero " << bbd);
 		const bool data_is_zero = bbd.is_data_zero();
 		if (data_is_zero) {
 			bar.at(bar_p) = '0';
 			zero_set.insert(p);
 			BEESCOUNT(scan_zero);
 			continue;
 		}
 		// Schedule this block for insertion if we decide to keep this extent.
 		BEESCOUNT(scan_hash_preinsert);
 		BEESTRACE("Pushing hash " << hash << " addr " << addr << " bbd " << bbd);
 		insert_map.insert(make_pair(p, make_pair(hash, addr)));
-		bar.at(bar_p) = 'R';
+		bar.at(bar_p) = 'i';
-		// Weed out zero blocks
+		// Ensure we fill in the entire insert_map without skipping any non-zero blocks
-		BEESNOTE("is_data_zero " << bbd);
+		if (p < next_p) continue;
 		bool extent_is_zero = bbd.is_data_zero();
 		if (extent_is_zero) {
 			bar.at(bar_p) = '0';
 			if (extent_compressed) {
 				if (!extent_contains_zero) {
 					// BEESLOG("compressed zero bbd " << bbd << "\n\tin extent " << e);
 				}
 				extent_contains_zero = true;
 				// Do not attempt to lookup hash of zero block
 				continue;
 			} else {
 				BEESLOGINFO("zero bbd " << bbd << "\n\tin extent " << e);
 				BEESCOUNT(scan_zero_uncompressed);
 				rewrite_extent = true;
 				break;
 			}
 		} else {
 			if (extent_contains_zero && !extent_contains_nonzero) {
 				// BEESLOG("compressed nonzero bbd " << bbd << "\n\tin extent " << e);
 			}
 			extent_contains_nonzero = true;
 		}
 		BEESNOTE("lookup hash " << bbd);
-		auto found = hash_table->find_cell(hash);
+		const auto found = hash_table->find_cell(hash);
 		BEESCOUNT(scan_lookup);
 		set<BeesResolver> resolved_addrs;
 		set<BeesAddress> found_addrs;
 		list<BeesAddress> ordered_addrs;
-		// We know that there is at least one copy of the data and where it is,
+		for (const auto &i : found) {
 		// but we don't want to do expensive LOGICAL_INO operations unless there
 		// are at least two distinct addresses to look at.
 		found_addrs.insert(addr);
 		for (auto i : found) {
 			BEESTRACE("found (hash, address): " << i);
 			BEESCOUNT(scan_found);
 			// Hash has to match
 			THROW_CHECK2(runtime_error, i.e_hash, hash, i.e_hash == hash);
 			// We know that there is at least one copy of the data and where it is.
 			// Filter out anything that can't possibly match before we pull out the
 			// LOGICAL_INO hammer.
 			BeesAddress found_addr(i.e_addr);
 #if 0
 			// If address already in hash table, move on to next extent.
-			// We've already seen this block and may have made additional references to it.
+			// Only extents that are scanned but not modified are inserted, so if there's
-			// The current extent is effectively "pinned" and can't be modified any more.
+			// a matching hash:address pair in the hash table:
 			// 1.  We have already scanned this extent.
 			// 2.  We may have already created references to this extent.
 			// 3.  We won't scan this extent again.
 			// The current extent is effectively "pinned" and can't be modified
 			// without rescanning all the existing references.
 			if (found_addr.get_physical_or_zero() == addr.get_physical_or_zero()) {
 				// No log message because this happens to many thousands of blocks
 				// when bees is interrupted.
 				// BEESLOGDEBUG("Found matching hash " << hash << " at same address " << addr << ", skipping " << bfr);
 				BEESCOUNT(scan_already);
-				return bfr;
+				return;
 			}
 			// Address is a duplicate.
 			// Check this early so we don't have duplicate counts.
 			if (!found_addrs.insert(found_addr).second) {
 				BEESCOUNT(scan_twice);
 				continue;
 			}
 #endif
 			// Block must have matching EOF alignment
 			if (found_addr.is_unaligned_eof() != addr.is_unaligned_eof()) {
@@ -463,214 +534,353 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 				continue;
 			}
 			// Address is a duplicate
 			if (!found_addrs.insert(found_addr).second) {
 				BEESCOUNT(scan_twice);
 				continue;
 			}
 			// Hash is toxic
 			if (found_addr.is_toxic()) {
-				BEESLOGWARN("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
+				BEESLOGDEBUG("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
 				// Don't push these back in because we'll never delete them.
 				// Extents may become non-toxic so give them a chance to expire.
 				// hash_table->push_front_hash_addr(hash, found_addr);
 				BEESCOUNT(scan_toxic_hash);
-				return bfr;
+				return;
 			}
-			// Distinct address, go resolve it
+			// Put this address in the list without changing hash table order
-			bool abandon_extent = false;
+			ordered_addrs.push_back(found_addr);
-			catch_all([&]() {
+		}
 				BEESNOTE("resolving " << found_addr << " matched " << bbd);
 				BEESTRACE("resolving " << found_addr << " matched " << bbd);
 				BEESTRACE("BeesContext::scan_one_extent calling BeesResolver " << found_addr);
 				BeesResolver resolved(m_ctx, found_addr);
 				// Toxic extents are really toxic
 				if (resolved.is_toxic()) {
 					BEESLOGWARN("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
 					BEESCOUNT(scan_toxic_match);
 					// Make sure we never see this hash again.
 					// It has become toxic since it was inserted into the hash table.
 					found_addr.set_toxic();
 					hash_table->push_front_hash_addr(hash, found_addr);
 					abandon_extent = true;
 				} else if (!resolved.count()) {
 					BEESCOUNT(scan_resolve_zero);
 					// Didn't find anything, address is dead
 					BEESTRACE("matched hash " << hash << " addr " << addr << " count zero");
 					hash_table->erase_hash_addr(hash, found_addr);
 				} else {
 					resolved_addrs.insert(resolved);
 					BEESCOUNT(scan_resolve_hit);
 				}
 			});
-			if (abandon_extent) {
+		// Cheap filtering is now out of the way, now for some heavy lifting
-				return bfr;
+		for (auto found_addr : ordered_addrs) {
 			// Hash table says there's a matching block on the filesystem.
 			// Go find refs to it.
 			BEESNOTE("resolving " << found_addr << " matched " << bbd);
 			BEESTRACE("resolving " << found_addr << " matched " << bbd);
 			BEESTRACE("BeesContext::scan_one_extent calling BeesResolver " << found_addr);
 			BeesResolver resolved(m_ctx, found_addr);
 			// Toxic extents are really toxic
 			if (resolved.is_toxic()) {
 				BEESLOGDEBUG("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
 				BEESCOUNT(scan_toxic_match);
 				// Make sure we never see this hash again.
 				// It has become toxic since it was inserted into the hash table.
 				found_addr.set_toxic();
 				hash_table->push_front_hash_addr(hash, found_addr);
 				return;
 			} else if (!resolved.count()) {
 				BEESCOUNT(scan_resolve_zero);
 				// Didn't find a block at the table address, address is dead
 				BEESLOGDEBUG("Erasing stale addr " << addr << " hash " << hash);
 				hash_table->erase_hash_addr(hash, found_addr);
 				continue;
 			} else {
 				BEESCOUNT(scan_resolve_hit);
 			}
 		}
-		// This shouldn't happen (often), so let's count it separately
+			// `resolved` contains references to a block on the filesystem that still exists.
 		if (resolved_addrs.size() > 2) {
 			BEESCOUNT(matched_3_or_more);
 		}
 		if (resolved_addrs.size() > 1) {
 			BEESCOUNT(matched_2_or_more);
 		}
 		// No need to do all this unless there are two or more distinct matches
 		if (!resolved_addrs.empty()) {
 			bar.at(bar_p) = 'M';
 			BEESCOUNT(matched_1_or_more);
 			BEESTRACE("resolved_addrs.size() = " << resolved_addrs.size());
 			BEESNOTE("resolving " << resolved_addrs.size() << " matches for hash " << hash);
-			BeesFileRange replaced_bfr;
+			BEESNOTE("finding one match (out of " << resolved.count() << ") at " << resolved.addr() << " for " << bbd);
 			BEESTRACE("finding one match (out of " << resolved.count() << ") at " << resolved.addr() << " for " << bbd);
 			auto replaced_brp = resolved.replace_dst(bbd);
 			BeesFileRange &replaced_bfr = replaced_brp.second;
 			BEESTRACE("next_p " << to_hex(next_p) << " -> replaced_bfr " << replaced_bfr);
-			BeesAddress last_replaced_addr;
+			// If we did find a block, but not this hash, correct the hash table and move on
-			for (auto it = resolved_addrs.begin(); it != resolved_addrs.end(); ++it) {
+			if (resolved.found_hash()) {
-				// FIXME:  Need to terminate this loop on replace_dst exception condition
+				BEESCOUNT(scan_hash_hit);
-				// catch_all([&]() {
+			} else {
-					auto it_copy = *it;
+				BEESLOGDEBUG("Erasing stale hash " << hash << " addr " << resolved.addr());
-					BEESNOTE("finding one match (out of " << it_copy.count() << ") at " << it_copy.addr() << " for " << bbd);
+				hash_table->erase_hash_addr(hash, resolved.addr());
-					BEESTRACE("finding one match (out of " << it_copy.count() << ") at " << it_copy.addr() << " for " << bbd);
+				BEESCOUNT(scan_hash_miss);
-					replaced_bfr = it_copy.replace_dst(bbd);
+				continue;
 					BEESTRACE("next_p " << to_hex(next_p) << " -> replaced_bfr " << replaced_bfr);
 					// If we didn't find this hash where the hash table said it would be,
 					// correct the hash table.
 					if (it_copy.found_hash()) {
 						BEESCOUNT(scan_hash_hit);
 					} else {
 						// BEESLOGDEBUG("erase src hash " << hash << " addr " << it_copy.addr());
 						BEESCOUNT(scan_hash_miss);
 						hash_table->erase_hash_addr(hash, it_copy.addr());
 					}
 					if (it_copy.found_dup()) {
 						BEESCOUNT(scan_dup_hit);
 						// FIXME:  we will thrash if we let multiple references to identical blocks
 						// exist in the hash table.  Erase all but the last one.
 						if (last_replaced_addr) {
 							BEESLOGINFO("Erasing redundant hash " << hash << " addr " << last_replaced_addr);
 							hash_table->erase_hash_addr(hash, last_replaced_addr);
 							BEESCOUNT(scan_erase_redundant);
 						}
 						last_replaced_addr = it_copy.addr();
 						// Invalidate resolve cache so we can count refs correctly
 						m_ctx->invalidate_addr(it_copy.addr());
 						m_ctx->invalidate_addr(bbd.addr());
 						// Remove deduped blocks from insert map
 						THROW_CHECK0(runtime_error, replaced_bfr);
 						for (off_t ip = replaced_bfr.begin(); ip < replaced_bfr.end(); ip += BLOCK_SIZE_SUMS) {
 							BEESCOUNT(scan_dup_block);
 							noinsert_set.insert(ip);
 							if (ip >= e.begin() && ip < e.end()) {
 								off_t bar_p = (ip - e.begin()) / BLOCK_SIZE_SUMS;
 								bar.at(bar_p) = 'd';
 							}
 						}
 						// next_p may be past EOF so check p only
 						THROW_CHECK2(runtime_error, p, replaced_bfr, p < replaced_bfr.end());
 						BEESCOUNT(scan_bump);
 						next_p = replaced_bfr.end();
 					} else {
 						BEESCOUNT(scan_dup_miss);
 					}
 				// });
 			}
-			if (last_replaced_addr) {
+
-				// If we replaced extents containing the incoming addr,
+			// We found a block and it was a duplicate
-				// push the addr we kept to the front of the hash LRU.
+			if (resolved.found_dup()) {
-				hash_table->push_front_hash_addr(hash, last_replaced_addr);
+				THROW_CHECK0(runtime_error, replaced_bfr);
-				BEESCOUNT(scan_push_front);
+				BEESCOUNT(scan_dup_hit);
 				// Save this match.  If a better match is found later,
 				// it will be replaced.
 				dedupe_list.push_back(replaced_brp);
 				// Push matching block to front of LRU
 				front_hash_list.push_back(make_pair(hash, resolved.addr()));
 				// This is the block that matched in the replaced bfr
 				bar.at(bar_p) = '=';
 				// Invalidate resolve cache so we can count refs correctly
 				invalidate_addr_list.push_back(resolved.addr());
 				invalidate_addr_list.push_back(bbd.addr());
 				// next_p may be past EOF so check p only
 				THROW_CHECK2(runtime_error, p, replaced_bfr, p < replaced_bfr.end());
 				// We may find duplicate ranges of various lengths, so make sure
 				// we don't pick a smaller one
 				next_p = max(next_p, replaced_bfr.end());
 				// Stop after one dedupe is found.  If there's a longer matching range
 				// out there, we'll find a matching block after the end of this range,
 				// since the longer range is longer than this one.
 				break;
 			} else {
 				BEESCOUNT(scan_dup_miss);
 			}
 		} else {
 			BEESCOUNT(matched_0);
 		}
 	}
-	// If the extent was compressed and all zeros, nuke entire thing
+	bool force_insert = false;
-	if (!rewrite_extent && (extent_contains_zero && !extent_contains_nonzero)) {
+
-		rewrite_extent = true;
+	// We don't want to punch holes into compressed extents, unless:
-		BEESCOUNT(scan_zero_compressed);
+	// 1.  There was dedupe of non-zero blocks, so we always have to copy the rest of the extent
 	// 2.  The entire extent is zero and the whole thing can be replaced with a single hole
 	const bool extent_compressed = e.flags() & FIEMAP_EXTENT_ENCODED;
 	if (extent_compressed && dedupe_list.empty() && !insert_map.empty()) {
 		// BEESLOGDEBUG("Compressed extent with non-zero data and no dedupe, skipping");
 		BEESCOUNT(scan_compressed_no_dedup);
 		force_insert = true;
 	}
-	// If we deduped any blocks then we must rewrite the remainder of the extent
+	// FIXME:  dedupe_list contains a lot of overlapping matches.  Get rid of all but one.
-	if (!noinsert_set.empty()) {
+	list<BeesRangePair> dedupe_list_out;
-		rewrite_extent = true;
+	dedupe_list.sort([](const BeesRangePair &a, const BeesRangePair &b) {
 		return b.second.size() < a.second.size();
 	});
 	// Shorten each dedupe brp by removing any overlap with earlier (longer) extents in list
 	for (auto i : dedupe_list) {
 		bool insert_i = true;
 		BEESTRACE("i = " << i << " insert_i " << insert_i);
 		for (const auto &j : dedupe_list_out) {
 			BEESTRACE("j = " << j);
 			// No overlap, try next one
 			if (j.second.end() <= i.second.begin() || j.second.begin() >= i.second.end()) {
 				continue;
 			}
 			// j fully overlaps or is the same as i, drop i
 			if (j.second.begin() <= i.second.begin() && j.second.end() >= i.second.end()) {
 				insert_i = false;
 				break;
 			}
 			// i begins outside j, i ends inside j, remove the end of i
 			if (i.second.end() > j.second.begin() && i.second.begin() <= j.second.begin()) {
 				const auto delta = i.second.end() - j.second.begin();
 				if (delta == i.second.size()) {
 					insert_i = false;
 					break;
 				}
 				i.shrink_end(delta);
 				continue;
 			}
 			// i begins inside j, ends outside j, remove the begin of i
 			if (i.second.begin() < j.second.end() && i.second.end() >= j.second.end()) {
 				const auto delta = j.second.end() - i.second.begin();
 				if (delta == i.second.size()) {
 					insert_i = false;
 					break;
 				}
 				i.shrink_begin(delta);
 				continue;
 			}
 			// i fully overlaps j, split i into two parts, push the other part onto dedupe_list
 			if (j.second.begin() > i.second.begin() && j.second.end() < i.second.end()) {
 				auto other_i = i;
 				const auto end_left_delta = i.second.end() - j.second.begin();
 				const auto begin_right_delta = i.second.begin() - j.second.end();
 				i.shrink_end(end_left_delta);
 				other_i.shrink_begin(begin_right_delta);
 				dedupe_list.push_back(other_i);
 				continue;
 			}
 			// None of the sbove.  Oops!
 			THROW_CHECK0(runtime_error, false);
 		}
 		if (insert_i) {
 			dedupe_list_out.push_back(i);
 		}
 	}
 	dedupe_list = dedupe_list_out;
 	dedupe_list_out.clear();
 	// Count total dedupes
 	uint64_t bytes_deduped = 0;
 	for (const auto &i : dedupe_list) {
 		// Remove deduped blocks from insert map and zero map
 		for (off_t ip = i.second.begin(); ip < i.second.end(); ip += BLOCK_SIZE_SUMS) {
 			BEESCOUNT(scan_dup_block);
 			dedupe_set.insert(ip);
 			zero_set.erase(ip);
 		}
 		bytes_deduped += i.second.size();
 	}
-	// If we need to replace part of the extent, rewrite all instances of it
+	// Copy all blocks of the extent that were not deduped or zero, but don't copy an entire extent
-	if (rewrite_extent) {
+	uint64_t bytes_zeroed = 0;
-		bool blocks_rewritten = false;
+	if (!force_insert) {
 		BEESTRACE("Rewriting extent " << e);
 		off_t last_p = e.begin();
 		off_t p = last_p;
-		off_t next_p;
+		off_t next_p = last_p;
 		BEESTRACE("next_p " << to_hex(next_p) << " p " << to_hex(p) << " last_p " << to_hex(last_p));
 		for (next_p = e.begin(); next_p < e.end(); ) {
 			p = next_p;
-			next_p += BLOCK_SIZE_SUMS;
+			next_p = min(next_p + BLOCK_SIZE_SUMS, e.end());
-			// BEESLOG("noinsert_set.count(" << to_hex(p) << ") " << noinsert_set.count(p));
+			// Can't be both dedupe and zero
-			if (noinsert_set.count(p)) {
+			THROW_CHECK2(runtime_error, zero_set.count(p), dedupe_set.count(p), zero_set.count(p) + dedupe_set.count(p) < 2);
 			if (zero_set.count(p)) {
 				bytes_zeroed += next_p - p;
 			}
 			// BEESLOG("dedupe_set.count(" << to_hex(p) << ") " << dedupe_set.count(p));
 			if (dedupe_set.count(p)) {
 				if (p - last_p > 0) {
-					rewrite_file_range(BeesFileRange(bfr.fd(), last_p, p));
+					THROW_CHECK2(runtime_error, p, e.end(), p <= e.end());
-					blocks_rewritten = true;
+					copy_list.push_back(BeesFileRange(bfr.fd(), last_p, p));
 				}
 				last_p = next_p;
 			} else {
 				off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
 				bar.at(bar_p) = '+';
 			}
 		}
 		BEESTRACE("last");
-		if (next_p - last_p > 0) {
+		if (next_p > last_p) {
-			rewrite_file_range(BeesFileRange(bfr.fd(), last_p, next_p));
+			THROW_CHECK2(runtime_error, next_p, e.end(), next_p <= e.end());
-			blocks_rewritten = true;
+			copy_list.push_back(BeesFileRange(bfr.fd(), last_p, next_p));
 		}
 		if (blocks_rewritten) {
 			// Nothing left to insert, all blocks clobbered
 			insert_map.clear();
 		} else {
 			// BEESLOG("No blocks rewritten");
 			BEESCOUNT(scan_no_rewrite);
 		}
 	}
-	// We did not rewrite the extent and it contained data, so insert it.
+	// Don't copy an entire extent
-	for (auto i : insert_map) {
+	if (!bytes_zeroed && copy_list.size() == 1 && copy_list.begin()->size() == e.size()) {
-		off_t bar_p = (i.first - e.begin()) / BLOCK_SIZE_SUMS;
+		copy_list.clear();
-		BEESTRACE("e " << e << "bar_p = " << bar_p << " i.first-e.begin() " << i.first - e.begin() << " i.second " << i.second.first << ", " << i.second.second);
+	}
-		if (noinsert_set.count(i.first)) {
+
-			// FIXME:  we removed one reference to this copy.  Avoid thrashing?
+	// Count total copies
-			hash_table->erase_hash_addr(i.second.first, i.second.second);
+	uint64_t bytes_copied = 0;
-			// Block was clobbered, do not insert
+	for (const auto &i : copy_list) {
-			// Will look like 'Ddddd' because we skip deduped blocks
+		bytes_copied += i.size();
-			bar.at(bar_p) = 'D';
+	}
-			BEESCOUNT(inserted_clobbered);
+
 	BEESTRACE("bar: " << bar);
 	// Don't do nuisance dedupes part 1:  free more blocks than we create
 	THROW_CHECK3(runtime_error, bytes_copied, bytes_zeroed, bytes_deduped, bytes_copied >= bytes_zeroed);
 	const auto cost_copy = bytes_copied - bytes_zeroed;
 	const auto gain_dedupe = bytes_deduped + bytes_zeroed;
 	if (cost_copy > gain_dedupe) {
 		BEESLOGDEBUG("Too many bytes copied (" << pretty(bytes_copied) << ") for bytes deduped (" << pretty(bytes_deduped) << ") and holes punched (" << pretty(bytes_zeroed) << "), skipping extent");
 		BEESCOUNT(scan_skip_bytes);
 		force_insert = true;
 	}
 	// Don't do nuisance dedupes part 2:  nobody needs more than 100 dedupe/copy ops in one extent
 	if (dedupe_list.size() + copy_list.size() > 100) {
 		BEESLOGDEBUG("Too many dedupe (" << dedupe_list.size() << ") and copy (" << copy_list.size() << ") operations, skipping extent");
 		BEESCOUNT(scan_skip_ops);
 		force_insert = true;
 	}
 	// Track whether we rewrote anything
 	bool extent_modified = false;
 	// If we didn't delete the dedupe list, do the dedupes now
 	for (const auto &i : dedupe_list) {
 		BEESNOTE("dedup " << i);
 		if (force_insert || m_ctx->dedup(i)) {
 			BEESCOUNT(replacedst_dedup_hit);
 			THROW_CHECK0(runtime_error, i.second);
 			for (off_t ip = i.second.begin(); ip < i.second.end(); ip += BLOCK_SIZE_SUMS) {
 				if (ip >= e.begin() && ip < e.end()) {
 					off_t bar_p = (ip - e.begin()) / BLOCK_SIZE_SUMS;
 					if (bar.at(bar_p) != '=') {
 						if (ip == i.second.begin()) {
 							bar.at(bar_p) = '<';
 						} else if (ip + BLOCK_SIZE_SUMS >= i.second.end()) {
 							bar.at(bar_p) = '>';
 						} else {
 							bar.at(bar_p) = 'd';
 						}
 					}
 				}
 			}
 			extent_modified = !force_insert;
 		} else {
 			BEESLOGINFO("dedup failed: " << i);
 			BEESCOUNT(replacedst_dedup_miss);
 			// User data changed while we were looking up the extent, or we have a bug.
 			// We can't fix this, but we can immediately stop wasting effort.
 			return;
 		}
 	}
 	// Then the copy/rewrites
 	for (const auto &i : copy_list) {
 		if (!force_insert) {
 			rewrite_file_range(i);
 			extent_modified = true;
 		}
 		for (auto p = i.begin(); p < i.end(); p += BLOCK_SIZE_SUMS) {
 			off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
 			// Leave zeros as-is because they aren't really copies
 			if (bar.at(bar_p) != '0') {
 				bar.at(bar_p) = '+';
 			}
 		}
 	}
 	if (!force_insert) {
 		// Push matched hashes to front
 		for (const auto &i : front_hash_list) {
 			hash_table->push_front_hash_addr(i.first, i.second);
 			BEESCOUNT(scan_push_front);
 		}
 		// Invalidate cached resolves
 		for (const auto &i : invalidate_addr_list) {
 			m_ctx->invalidate_addr(i);
 		}
 	}
 	// Don't insert hashes pointing to an extent we just deleted
 	if (!extent_modified) {
 		// We did not rewrite the extent and it contained data, so insert it.
 		// BEESLOGDEBUG("Inserting " << insert_map.size() << " hashes from " << bfr);
 		for (const auto &i : insert_map) {
 			hash_table->push_random_hash_addr(i.second.first, i.second.second);
-			bar.at(bar_p) = '.';
+			off_t bar_p = (i.first - e.begin()) / BLOCK_SIZE_SUMS;
-			BEESCOUNT(inserted_block);
+			if (bar.at(bar_p) == 'i') {
 				bar.at(bar_p) = '.';
 			}
 			BEESCOUNT(scan_hash_insert);
 		}
 	}
 	// Visualize
 	if (bar != string(block_count, '.')) {
-		BEESLOGINFO("scan: " << pretty(e.size()) << " " << to_hex(e.begin()) << " [" << bar << "] " << to_hex(e.end()) << ' ' << name_fd(bfr.fd()));
+		BEESLOGINFO(
 			(force_insert ? "skip" : "scan") << ": "
 			<< pretty(e.size()) << " "
 			<< dedupe_list.size() << "d" << copy_list.size() << "c"
 			<< ((bytes_zeroed + BLOCK_SIZE_SUMS - 1) / BLOCK_SIZE_SUMS) << "p"
 			<< (extent_compressed ? "z " : " ")
 			<< one_timer << "s {"
 			<< to_hex(e.bytenr()) << "+" << to_hex(e.offset()) << "} "
 			<< to_hex(e.begin()) << " [" << bar << "] " << to_hex(e.end())
 			<< ' ' << name_fd(bfr.fd())
 		);
 	}
-	// Costs 10% on benchmarks
+	// Put this extent into the recently seen list if we didn't rewrite it,
 	// and remove it if we did.
 	lock_seen.lock();
 	if (extent_modified) {
 		s_seen.erase(tup);
 		BEESCOUNT(scan_seen_erase);
 	} else {
 		// BEESLOGDEBUG("Seen " << tup << " " << e);
 		s_seen.insert(tup);
 		BEESCOUNT(scan_seen_insert);
 	}
 	lock_seen.unlock();
 	// Now causes 75% loss of performance in benchmarks
 	// bees_unreadahead(bfr.fd(), bfr.begin(), bfr.size());
 	return bfr;
 }
 shared_ptr<Exclusion>
@@ -703,14 +913,14 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)
 	// No FD?  Well, that was quick.
 	if (!bfr.fd()) {
 		// BEESLOGINFO("No FD in " << root_path() << " for " << bfr);
-		BEESCOUNT(scan_no_fd);
+		BEESCOUNT(scanf_no_fd);
 		return false;
 	}
 	// Sanity check
 	if (bfr.begin() >= bfr.file_size()) {
-		BEESLOGWARN("past EOF: " << bfr);
+		BEESLOGDEBUG("past EOF: " << bfr);
-		BEESCOUNT(scan_eof);
+		BEESCOUNT(scanf_eof);
 		return false;
 	}
@@ -730,9 +940,11 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)
 					// BEESLOGDEBUG("Deferring extent bytenr " << to_hex(extent_bytenr) << " from " << bfr);
 					BEESCOUNT(scanf_deferred_extent);
 					start_over = true;
 					return; // from closure
 				}
 				Timer one_extent_timer;
 				scan_one_extent(bfr, e);
 				// BEESLOGDEBUG("Scanned " << e << " " << bfr);
 				BEESCOUNTADD(scanf_extent_ms, one_extent_timer.age() * 1000);
 				BEESCOUNT(scanf_extent);
 			});
@@ -784,9 +996,10 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
 	Timer resolve_timer;
 	struct rusage usage_before;
 	struct rusage usage_after;
 	{
 		BEESNOTE("waiting to resolve addr " << addr << " with LOGICAL_INO");
-		const auto lock = MultiLocker::get_lock("logical_ino");
+		auto lock = MultiLocker::get_lock("logical_ino");
 		// Get this thread's system CPU usage
 		DIE_IF_MINUS_ONE(getrusage(RUSAGE_THREAD, &usage_before));
@@ -800,13 +1013,13 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
 		} else {
 			BEESCOUNT(resolve_fail);
 		}
-		BEESCOUNTADD(resolve_ms, resolve_timer.age() * 1000);
+		DIE_IF_MINUS_ONE(getrusage(RUSAGE_THREAD, &usage_after));
 		const auto resolve_timer_age = resolve_timer.age();
 		BEESCOUNTADD(resolve_ms, resolve_timer_age * 1000);
 		lock.reset();
 		bees_throttle(resolve_timer_age, "resolve_addr");
 	}
 	// Again!
 	struct rusage usage_after;
 	DIE_IF_MINUS_ONE(getrusage(RUSAGE_THREAD, &usage_after));
 	const double sys_usage_delta =
 		(usage_after.ru_stime.tv_sec + usage_after.ru_stime.tv_usec / 1000000.0) -
 		(usage_before.ru_stime.tv_sec + usage_before.ru_stime.tv_usec / 1000000.0);
@@ -913,19 +1126,20 @@ BeesContext::start()
 	m_progress_thread = make_shared<BeesThread>("progress_report");
 	m_progress_thread = make_shared<BeesThread>("progress_report");
 	m_status_thread = make_shared<BeesThread>("status_report");
-	m_progress_thread->exec([=]() {
+	m_progress_thread->exec([this]() {
 		show_progress();
 	});
-	m_status_thread->exec([=]() {
+	m_status_thread->exec([this]() {
 		dump_status();
 	});
 	// Set up temporary file pool
-	m_tmpfile_pool.generator([=]() -> shared_ptr<BeesTempFile> {
+	m_tmpfile_pool.generator([this]() -> shared_ptr<BeesTempFile> {
 		return make_shared<BeesTempFile>(shared_from_this());
 	});
 	m_logical_ino_pool.generator([]() {
-		return make_shared<BtrfsIoctlLogicalInoArgs>(0);
+		const auto extent_ref_size = sizeof(uint64_t) * 3;
 		return make_shared<BtrfsIoctlLogicalInoArgs>(0, BEES_MAX_EXTENT_REF_COUNT * extent_ref_size + sizeof(btrfs_data_container));
 	});
 	m_tmpfile_pool.checkin([](const shared_ptr<BeesTempFile> &btf) {
 		catch_all([&](){
@@ -356,6 +356,8 @@ BeesHashTable::prefetch_loop()
 		auto avg_rates = thisStats / m_ctx->total_timer().age();
 		graph_blob << "\t" << avg_rates << "\n";
 		graph_blob << m_ctx->get_progress();
 		BEESLOGINFO(graph_blob.str());
 		catch_all([&]() {
 			m_stats_file.write(graph_blob.str());
@@ -446,10 +448,38 @@ BeesHashTable::fetch_missing_extent_by_index(uint64_t extent_index)
 		// If we are in prefetch, give the kernel a hint about the next extent
 		if (m_prefetch_running) {
-			// XXX: don't call this if bees_readahead is implemented by pread()
+			// Use the kernel readahead here, because it might work for this use case
-			bees_readahead(m_fd, dirty_extent_offset + dirty_extent_size, dirty_extent_size);
+			readahead(m_fd, dirty_extent_offset + dirty_extent_size, dirty_extent_size);
 		}
 	});
 	Cell *cell     = m_extent_ptr[extent_index    ].p_buckets[0].p_cells;
 	Cell *cell_end = m_extent_ptr[extent_index + 1].p_buckets[0].p_cells;
 	size_t toxic_cleared_count = 0;
 	set<BeesHashTable::Cell> seen_it(cell, cell_end);
 	while (cell < cell_end) {
 		if (cell->e_addr & BeesAddress::c_toxic_mask) {
 			++toxic_cleared_count;
 			cell->e_addr &= ~BeesAddress::c_toxic_mask;
 			// Clearing the toxic bit might mean we now have a duplicate.
 			// This could be due to a race between two
 			// inserts, one finds the extent toxic while the
 			// other does not.  That's arguably a bug elsewhere,
 			// but we should rewrite the whole extent lookup/insert
 			// loop, not spend time fixing code that will be
 			// thrown out later anyway.
 			// If there is a cell that is identical to this one
 			// except for the toxic bit, then we don't need this one.
 			if (seen_it.count(*cell)) {
 				cell->e_addr = 0;
 				cell->e_hash = 0;
 			}
 		}
 		++cell;
 	}
 	if (toxic_cleared_count) {
 		BEESLOGDEBUG("Cleared " << toxic_cleared_count << " hashes while fetching hash table extent " << extent_index);
 	}
 }
 void
@@ -767,7 +797,7 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
 	for (auto fp = madv_flags; fp->value; ++fp) {
 		BEESTOOLONG("madvise(" << fp->name << ")");
 		if (madvise(m_byte_ptr, m_size, fp->value)) {
-			BEESLOGWARN("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
+			BEESLOGNOTICE("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
 		}
 	}
@@ -781,8 +811,19 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
 		prefetch_loop();
        });
-	// Blacklist might fail if the hash table is not stored on a btrfs
+	// Blacklist might fail if the hash table is not stored on a btrfs,
 	// or if it's on a _different_ btrfs
 	catch_all([&]() {
 		// Root is definitely a btrfs
 		BtrfsIoctlFsInfoArgs root_info;
 		root_info.do_ioctl(m_ctx->root_fd());
 		// Hash might not be a btrfs
 		BtrfsIoctlFsInfoArgs hash_info;
 		// If btrfs fs_info ioctl fails, it must be a different fs
 		if (!hash_info.do_ioctl_nothrow(m_fd)) return;
 		// If Hash is a btrfs, Root must be the same one
 		if (root_info.fsid() != hash_info.fsid()) return;
 		// Hash is on the same one, blacklist it
 		m_ctx->blacklist_insert(BeesFileId(m_fd));
 	});
 }
@@ -384,7 +384,7 @@ BeesResolver::for_each_extent_ref(BeesBlockData bbd, function<bool(const BeesFil
 	return stop_now;
 }
-BeesFileRange
+BeesRangePair
 BeesResolver::replace_dst(const BeesFileRange &dst_bfr_in)
 {
 	BEESTRACE("replace_dst dst_bfr " << dst_bfr_in);
@@ -400,6 +400,7 @@ BeesResolver::replace_dst(const BeesFileRange &dst_bfr_in)
 	BEESTRACE("overlap_bfr " << overlap_bfr);
 	BeesBlockData bbd(dst_bfr);
 	BeesRangePair rv = { BeesFileRange(), BeesFileRange() };
 	for_each_extent_ref(bbd, [&](const BeesFileRange &src_bfr_in) -> bool {
 		// Open src
@@ -436,21 +437,12 @@ BeesResolver::replace_dst(const BeesFileRange &dst_bfr_in)
 			BEESCOUNT(replacedst_grown);
 		}
-		// Dedup
+		rv = brp;
-		BEESNOTE("dedup " << brp);
+		m_found_dup = true;
-		if (m_ctx->dedup(brp)) {
+		return true;
 			BEESCOUNT(replacedst_dedup_hit);
 			m_found_dup = true;
 			overlap_bfr = brp.second;
 			// FIXME:  find best range first, then dedupe that
 			return true; // i.e. break
 		} else {
 			BEESCOUNT(replacedst_dedup_miss);
 			return false; // i.e. continue
 		}
 	});
 	// BEESLOG("overlap_bfr after " << overlap_bfr);
-	return overlap_bfr.copy_closed();
+	return rv;
 }
 BeesFileRange
@@ -14,7 +14,7 @@ BeesThread::exec(function<void()> func)
 {
 	m_timer.reset();
 	BEESLOGDEBUG("BeesThread exec " << m_name);
-	m_thread_ptr = make_shared<thread>([=]() {
+	m_thread_ptr = make_shared<thread>([this, func]() {
 		BeesNote::set_name(m_name);
 		BEESLOGDEBUG("Starting thread " << m_name);
 		BEESNOTE("thread function");
@@ -8,38 +8,32 @@ thread_local BeesTracer *BeesTracer::tl_next_tracer = nullptr;
 thread_local bool BeesTracer::tl_first = true;
 thread_local bool BeesTracer::tl_silent = false;
 bool
 exception_check()
 {
 #if __cplusplus >= 201703
 static
 bool
 exception_check()
 {
 	return uncaught_exceptions();
 }
 #else
 static
 bool
 exception_check()
 {
 	return uncaught_exception();
 }
 #endif
 }
 BeesTracer::~BeesTracer()
 {
 	if (!tl_silent && exception_check()) {
 		if (tl_first) {
-			BEESLOGNOTICE("--- BEGIN TRACE --- exception ---");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE --- exception ---");
 			tl_first = false;
 		}
 		try {
 			m_func();
 		} catch (exception &e) {
-			BEESLOGNOTICE("Nested exception: " << e.what());
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception: " << e.what());
 		} catch (...) {
-			BEESLOGNOTICE("Nested exception ...");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception ...");
 		}
 		if (!m_next_tracer) {
-			BEESLOGNOTICE("---  END  TRACE --- exception ---");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: ---  END  TRACE --- exception ---");
 		}
 	}
 	tl_next_tracer = m_next_tracer;
@@ -49,7 +43,7 @@ BeesTracer::~BeesTracer()
 	}
 }
-BeesTracer::BeesTracer(function<void()> f, bool silent) :
+BeesTracer::BeesTracer(const function<void()> &f, bool silent) :
 	m_func(f)
 {
 	m_next_tracer = tl_next_tracer;
@@ -61,12 +55,12 @@ void
 BeesTracer::trace_now()
 {
 	BeesTracer *tp = tl_next_tracer;
-	BEESLOGNOTICE("--- BEGIN TRACE ---");
+	BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE ---");
 	while (tp) {
 		tp->m_func();
 		tp = tp->m_next_tracer;
 	}
-	BEESLOGNOTICE("---  END  TRACE ---");
+	BEESLOG(BEES_TRACE_LEVEL, "TRACE: ---  END  TRACE ---");
 }
 bool
@@ -91,9 +85,9 @@ BeesNote::~BeesNote()
 	tl_next = m_prev;
 	unique_lock<mutex> lock(s_mutex);
 	if (tl_next) {
-		s_status[crucible::gettid()] = tl_next;
+		s_status[gettid()] = tl_next;
 	} else {
-		s_status.erase(crucible::gettid());
+		s_status.erase(gettid());
 	}
 }
@@ -104,7 +98,7 @@ BeesNote::BeesNote(function<void(ostream &os)> f) :
 	m_prev = tl_next;
 	tl_next = this;
 	unique_lock<mutex> lock(s_mutex);
-	s_status[crucible::gettid()] = tl_next;
+	s_status[gettid()] = tl_next;
 }
 void
@@ -183,6 +183,24 @@ BeesFileRange::grow_begin(off_t delta)
 	return m_begin;
 }
 off_t
 BeesFileRange::shrink_begin(off_t delta)
 {
 	THROW_CHECK1(invalid_argument, delta, delta > 0);
 	THROW_CHECK3(invalid_argument, delta, m_begin, m_end, delta + m_begin < m_end);
 	m_begin += delta;
 	return m_begin;
 }
 off_t
 BeesFileRange::shrink_end(off_t delta)
 {
 	THROW_CHECK1(invalid_argument, delta, delta > 0);
 	THROW_CHECK2(invalid_argument, delta, m_end, m_end >= delta);
 	m_end -= delta;
 	return m_end;
 }
 BeesFileRange::BeesFileRange(const BeesBlockData &bbd) :
 	m_fd(bbd.fd()),
 	m_begin(bbd.begin()),
@@ -349,8 +367,8 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 	BEESTRACE("e_second " << e_second);
 	// Preread entire extent
-	bees_readahead(second.fd(), e_second.begin(), e_second.size());
+	bees_readahead_pair(second.fd(), e_second.begin(), e_second.size(),
-	bees_readahead(first.fd(), e_second.begin() + first.begin() - second.begin(), e_second.size());
+			    first.fd(), e_second.begin() + first.begin() - second.begin(), e_second.size());
 	auto hash_table = ctx->hash_table();
@@ -388,17 +406,6 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 			break;
 		}
 		// Source extent cannot be toxic
 		BeesAddress first_addr(first.fd(), new_first.begin());
 		if (!first_addr.is_magic()) {
 			auto first_resolved = ctx->resolve_addr(first_addr);
 			if (first_resolved.is_toxic()) {
 				BEESLOGWARN("WORKAROUND: not growing matching pair backward because src addr is toxic:\n" << *this);
 				BEESCOUNT(pairbackward_toxic_addr);
 				break;
 			}
 		}
 		// Extend second range.  If we hit BOF we can go no further.
 		BeesFileRange new_second = second;
 		BEESTRACE("new_second = " << new_second);
@@ -434,6 +441,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 		}
 		// Source block cannot be zero in a non-compressed non-magic extent
 		BeesAddress first_addr(first.fd(), new_first.begin());
 		if (first_bbd.is_data_zero() && !first_addr.is_magic() && !first_addr.is_compressed()) {
 			BEESCOUNT(pairbackward_zero);
 			break;
@@ -449,7 +457,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 			}
 		}
 		if (found_toxic) {
-			BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
+			BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
 			BEESCOUNT(pairbackward_toxic_hash);
 			break;
 		}
@@ -491,17 +499,6 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 			break;
 		}
 		// Source extent cannot be toxic
 		BeesAddress first_addr(first.fd(), new_first.begin());
 		if (!first_addr.is_magic()) {
 			auto first_resolved = ctx->resolve_addr(first_addr);
 			if (first_resolved.is_toxic()) {
 				BEESLOGWARN("WORKAROUND: not growing matching pair forward because src is toxic:\n" << *this);
 				BEESCOUNT(pairforward_toxic);
 				break;
 			}
 		}
 		// Extend second range.  If we hit EOF we can go no further.
 		BeesFileRange new_second = second;
 		BEESTRACE("new_second = " << new_second);
@@ -545,6 +542,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 		}
 		// Source block cannot be zero in a non-compressed non-magic extent
 		BeesAddress first_addr(first.fd(), new_first.begin());
 		if (first_bbd.is_data_zero() && !first_addr.is_magic() && !first_addr.is_compressed()) {
 			BEESCOUNT(pairforward_zero);
 			break;
@@ -560,7 +558,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 			}
 		}
 		if (found_toxic) {
-			BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
+			BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
 			BEESCOUNT(pairforward_toxic_hash);
 			break;
 		}
@@ -574,7 +572,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 	}
 	if (first.overlaps(second)) {
-		BEESLOGTRACE("after grow, first " << first << "\n\toverlaps " << second);
+		BEESLOGDEBUG("after grow, first " << first << "\n\toverlaps " << second);
 		BEESCOUNT(bug_grow_pair_overlaps);
 	}
@@ -589,6 +587,22 @@ BeesRangePair::copy_closed() const
 	return BeesRangePair(first.copy_closed(), second.copy_closed());
 }
 void
 BeesRangePair::shrink_begin(off_t const delta)
 {
 	first.shrink_begin(delta);
 	second.shrink_begin(delta);
 	THROW_CHECK2(runtime_error, first.size(), second.size(), first.size() == second.size());
 }
 void
 BeesRangePair::shrink_end(off_t const delta)
 {
 	first.shrink_end(delta);
 	second.shrink_end(delta);
 	THROW_CHECK2(runtime_error, first.size(), second.size(), first.size() == second.size());
 }
 ostream &
 operator<<(ostream &os, const BeesAddress &ba)
 {
@@ -660,7 +674,7 @@ BeesAddress::magic_check(uint64_t flags)
 	static const unsigned recognized_flags = compressed_flags | delalloc_flags | ignore_flags | unusable_flags;
 	if (flags & ~recognized_flags) {
-		BEESLOGTRACE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
+		BEESLOGNOTICE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
 		m_addr = UNUSABLE;
 		// maybe we throw here?
 		BEESCOUNT(addr_unrecognized);
@@ -12,9 +12,10 @@ Load management options:
    -C, --thread-factor   Worker thread factor (default 1)
    -G, --thread-min      Minimum worker thread count (default 0)
    -g, --loadavg-target  Target load average for worker threads (default none)
        --throttle-factor Idle time between operations (default 1.0)
 Filesystem tree traversal options:
-    -m, --scan-mode       Scanning mode (0..2, default 0)
+    -m, --scan-mode       Scanning mode (0..4, default 4)
 Workarounds:
    -a, --workaround-btrfs-send    Workaround for btrfs send
@@ -4,6 +4,7 @@
 #include "crucible/process.h"
 #include "crucible/string.h"
 #include "crucible/task.h"
 #include "crucible/uname.h"
 #include <cctype>
 #include <cmath>
@@ -11,17 +12,19 @@
 #include <iostream>
 #include <memory>
 #include <regex>
 #include <sstream>
 // PRIx64
 #include <inttypes.h>
 #include <sched.h>
 #include <sys/fanotify.h>
 #include <linux/fs.h>
 #include <sys/ioctl.h>
 // statfs
 #include <linux/magic.h>
 #include <sys/statfs.h>
 // setrlimit
 #include <sys/time.h>
 #include <sys/resource.h>
@@ -198,7 +201,7 @@ BeesTooLong::check() const
 	if (age() > m_limit) {
 		ostringstream oss;
 		m_func(oss);
-		BEESLOGWARN("PERFORMANCE: " << *this << " sec: " << oss.str());
+		BEESLOGINFO("PERFORMANCE: " << *this << " sec: " << oss.str());
 	}
 }
@@ -214,23 +217,45 @@ BeesTooLong::operator=(const func_type &f)
 	return *this;
 }
-void
+static
-bees_readahead(int const fd, const off_t offset, const size_t size)
+bool
 bees_readahead_check(int const fd, off_t const offset, size_t const size)
 {
 	// FIXME: the rest of the code calls this function more often than necessary,
 	// usually back-to-back calls on the same range in a loop.
 	// Simply discard requests that are identical to recent requests.
 	const Stat stat_rv(fd);
 	auto tup = make_tuple(offset, size, stat_rv.st_dev, stat_rv.st_ino);
 	static mutex s_recent_mutex;
 	static set<decltype(tup)> s_recent;
 	static Timer s_recent_timer;
 	unique_lock<mutex> lock(s_recent_mutex);
 	if (s_recent_timer.age() > 5.0) {
 		s_recent_timer.reset();
 		s_recent.clear();
 		BEESCOUNT(readahead_clear);
 	}
 	const auto rv = s_recent.insert(tup);
 	// If we recently did this readahead, we're done here
 	if (!rv.second) {
 		BEESCOUNT(readahead_skip);
 	}
 	return rv.second;
 }
 static
 void
 bees_readahead_nolock(int const fd, const off_t offset, const size_t size)
 {
 	if (!bees_readahead_check(fd, offset, size)) return;
 	Timer readahead_timer;
 	BEESNOTE("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
 	BEESTOOLONG("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
 #if 0
 	// In the kernel, readahead() is identical to posix_fadvise(..., POSIX_FADV_DONTNEED)
 	DIE_IF_NON_ZERO(readahead(fd, offset, size));
 #else
 	// Make sure this data is in page cache by brute force
-	// This isn't necessary and it might even be slower,
+	// The btrfs kernel code does readahead with lower ioprio
-	// but the btrfs kernel code does readahead with lower ioprio
+	// and might discard the readahead request entirely.
 	// and might discard the readahead request entirely,
 	// so it's maybe, *maybe*, worth doing both.
 	BEESNOTE("emulating readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
-	auto working_size = size;
+	auto working_size = min(size, uint64_t(128 * 1024 * 1024));
 	auto working_offset = offset;
 	while (working_size) {
 		// don't care about multithreaded writes to this buffer--it is garbage anyway
@@ -239,16 +264,41 @@ bees_readahead(int const fd, const off_t offset, const size_t size)
 		// Ignore errors and short reads.  It turns out our size
 		// parameter isn't all that accurate, so we can't use
 		// the pread_or_die template.
-		(void)!pread(fd, dummy, this_read_size, working_offset);
+		const auto pr_rv = pread(fd, dummy, this_read_size, working_offset);
-		BEESCOUNT(readahead_count);
+		if (pr_rv >= 0) {
-		BEESCOUNTADD(readahead_bytes, this_read_size);
+			BEESCOUNT(readahead_count);
 			BEESCOUNTADD(readahead_bytes, pr_rv);
 		} else {
 			BEESCOUNT(readahead_fail);
 		}
 		working_offset += this_read_size;
 		working_size -= this_read_size;
 	}
 #endif
 	BEESCOUNTADD(readahead_ms, readahead_timer.age() * 1000);
 }
 static mutex s_only_one;
 void
 bees_readahead_pair(int fd, off_t offset, size_t size, int fd2, off_t offset2, size_t size2)
 {
 	if (!bees_readahead_check(fd, offset, size) && !bees_readahead_check(fd2, offset2, size2)) return;
 	BEESNOTE("waiting to readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size) << ","
 		<< "\n\t" << name_fd(fd2) << " offset " << to_hex(offset2) << " len " << pretty(size2));
 	unique_lock<mutex> m_lock(s_only_one);
 	bees_readahead_nolock(fd, offset, size);
 	bees_readahead_nolock(fd2, offset2, size2);
 }
 void
 bees_readahead(int const fd, const off_t offset, const size_t size)
 {
 	if (!bees_readahead_check(fd, offset, size)) return;
 	BEESNOTE("waiting to readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
 	unique_lock<mutex> m_lock(s_only_one);
 	bees_readahead_nolock(fd, offset, size);
 }
 void
 bees_unreadahead(int const fd, off_t offset, size_t size)
 {
@@ -259,6 +309,48 @@ bees_unreadahead(int const fd, off_t offset, size_t size)
 	BEESCOUNTADD(readahead_unread_ms, unreadahead_timer.age() * 1000);
 }
 static double bees_throttle_factor = 0.0;
 void
 bees_throttle(const double time_used, const char *const context)
 {
 	static mutex s_mutex;
 	unique_lock<mutex> throttle_lock(s_mutex);
 	struct time_pair {
 		double time_used = 0;
 		double time_count = 0;
 		double longest_sleep_time = 0;
 	};
 	static map<string, time_pair> s_time_map;
 	auto &this_time = s_time_map[context];
 	auto &this_time_used = this_time.time_used;
 	auto &this_time_count = this_time.time_count;
 	auto &longest_sleep_time = this_time.longest_sleep_time;
 	this_time_used += time_used;
 	++this_time_count;
 	// Keep the timing data fresh
 	static Timer s_fresh_timer;
 	if (s_fresh_timer.age() > 60) {
 		s_fresh_timer.reset();
 		this_time_count *= 0.9;
 		this_time_used *= 0.9;
 	}
 	// Wait for enough data to calculate rates
 	if (this_time_used < 1.0 || this_time_count < 1.0) return;
 	const auto avg_time = this_time_used / this_time_count;
 	const auto sleep_time = min(60.0, bees_throttle_factor * avg_time - time_used);
 	if (sleep_time <= 0) {
 		return;
 	}
 	if (sleep_time > longest_sleep_time) {
 		BEESLOGDEBUG(context << ": throttle delay " << sleep_time << " s, time used " << time_used << " s, avg time " << avg_time << " s");
 		longest_sleep_time = sleep_time;
 	}
 	throttle_lock.unlock();
 	BEESNOTE(context << ": throttle delay " << sleep_time << " s, time used " << time_used << " s, avg time " << avg_time << " s");
 	nanosleep(sleep_time);
 }
 thread_local random_device bees_random_device;
 thread_local uniform_int_distribution<default_random_engine::result_type> bees_random_seed_dist(
 	numeric_limits<default_random_engine::result_type>::min(),
@@ -304,6 +396,73 @@ BeesStringFile::read()
 	return read_string(fd, st.st_size);
 }
 static
 void
 bees_fsync(int const fd)
 {
 	// Note that when btrfs renames a temporary over an existing file,
 	// it flushes the temporary, so we get the right behavior if we
 	// just do nothing here (except when the file is first created;
 	// however, in that case the result is the same as if the file
 	// did not exist, was empty, or was filled with garbage).
 	//
 	// Kernel versions prior to 5.16 had bugs which would put ghost
 	// dirents in $BEESHOME if there was a crash when we called
 	// fsync() here.
 	//
 	// Some other filesystems will throw our data away if we don't
 	// call fsync, so we do need to call fsync() on those filesystems.
 	//
 	// Newer btrfs kernel versions rely on fsync() to report
 	// unrecoverable write errors.	If we don't check the fsync()
 	// result, we'll lose the data when we rename().  Kernel 6.2 added
 	// a number of new root causes for the class of "unrecoverable
 	// write errors" so we need to check this now.
 	BEESNOTE("checking filesystem type for " << name_fd(fd));
 	// LSB deprecated statfs without providing a replacement that
 	// can fill in the f_type field.
 	struct statfs stf = { 0 };
 	DIE_IF_NON_ZERO(fstatfs(fd, &stf));
 	if (static_cast<decltype(BTRFS_SUPER_MAGIC)>(stf.f_type) != BTRFS_SUPER_MAGIC) {
 		BEESLOGONCE("Using fsync on non-btrfs filesystem type " << to_hex(stf.f_type));
 		BEESNOTE("fsync non-btrfs " << name_fd(fd));
 		DIE_IF_NON_ZERO(fsync(fd));
 		return;
 	}
 	static bool did_uname = false;
 	static bool do_fsync = false;
 	if (!did_uname) {
 		Uname uname;
 		const string version(uname.release);
 		static const regex version_re(R"/(^(\d+)\.(\d+)\.)/", regex::optimize | regex::ECMAScript);
 		smatch m;
 		// Last known bug in the fsync-rename use case was fixed in kernel 5.16
 		static const auto min_major = 5, min_minor = 16;
 		if (regex_search(version, m, version_re)) {
 			const auto major = stoul(m[1]);
 			const auto minor = stoul(m[2]);
 			if (tie(major, minor) > tie(min_major, min_minor)) {
 				BEESLOGONCE("Using fsync on btrfs because kernel version is " << major << "." << minor);
 				do_fsync = true;
 			} else {
 				BEESLOGONCE("Not using fsync on btrfs because kernel version is " << major << "." << minor);
 			}
 		} else {
 			BEESLOGONCE("Not using fsync on btrfs because can't parse kernel version '" << version << "'");
 		}
 		did_uname = true;
 	}
 	if (do_fsync) {
 		BEESNOTE("fsync btrfs " << name_fd(fd));
 		DIE_IF_NON_ZERO(fsync(fd));
 	}
 }
 void
 BeesStringFile::write(string contents)
 {
@@ -319,19 +478,8 @@ BeesStringFile::write(string contents)
 		Fd ofd = openat_or_die(m_dir_fd, tmpname, FLAGS_CREATE_FILE, S_IRUSR | S_IWUSR);
 		BEESNOTE("writing " << tmpname << " in " << name_fd(m_dir_fd));
 		write_or_die(ofd, contents);
 #if 0
 		// This triggers too many btrfs bugs.  I wish I was kidding.
 		// Forget snapshots, balance, compression, and dedupe:
 		// the system call you have to fear on btrfs is fsync().
 		// Also note that when bees renames a temporary over an
 		// existing file, it flushes the temporary, so we get
 		// the right behavior if we just do nothing here
 		// (except when the file is first created; however,
 		// in that case the result is the same as if the file
 		// did not exist, was empty, or was filled with garbage).
 		BEESNOTE("fsyncing " << tmpname << " in " << name_fd(m_dir_fd));
-		DIE_IF_NON_ZERO(fsync(ofd));
+		bees_fsync(ofd);
 #endif
 	}
 	BEESNOTE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
 	BEESTRACE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
@@ -355,6 +503,25 @@ BeesTempFile::resize(off_t offset)
 	// Count time spent here
 	BEESCOUNTADD(tmp_resize_ms, resize_timer.age() * 1000);
 	// Modify flags - every time
 	// - btrfs will keep trying to set FS_NOCOMP_FL behind us when compression heuristics identify
 	//   the data as compressible, but it fails to compress
 	// - clear FS_NOCOW_FL because we can only dedupe between files with the same FS_NOCOW_FL state,
 	//   and we don't open FS_NOCOW_FL files for dedupe.
 	BEESTRACE("Getting FS_COMPR_FL and FS_NOCOMP_FL on m_fd " << name_fd(m_fd));
 	int flags = ioctl_iflags_get(m_fd);
 	const auto orig_flags = flags;
 	flags |= FS_COMPR_FL;
 	flags &= ~(FS_NOCOMP_FL | FS_NOCOW_FL);
 	if (flags != orig_flags) {
 		BEESTRACE("Setting FS_COMPR_FL and clearing FS_NOCOMP_FL | FS_NOCOW_FL on m_fd " << name_fd(m_fd) << " flags " << to_hex(flags));
 		ioctl_iflags_set(m_fd, flags);
 	}
 	// That may have queued some delayed ref deletes, so throttle them
 	bees_throttle(resize_timer.age(), "tmpfile_resize");
 }
 void
@@ -395,13 +562,6 @@ BeesTempFile::BeesTempFile(shared_ptr<BeesContext> ctx) :
 	// Add this file to open_root_ino lookup table
 	m_roots->insert_tmpfile(m_fd);
 	// Set compression attribute
 	BEESTRACE("Getting FS_COMPR_FL on m_fd " << name_fd(m_fd));
 	int flags = ioctl_iflags_get(m_fd);
 	flags |= FS_COMPR_FL;
 	BEESTRACE("Setting FS_COMPR_FL on m_fd " << name_fd(m_fd) << " flags " << to_hex(flags));
 	ioctl_iflags_set(m_fd, flags);
 	// Count time spent here
 	BEESCOUNTADD(tmp_create_ms, create_timer.age() * 1000);
@@ -490,6 +650,8 @@ BeesTempFile::make_copy(const BeesFileRange &src)
 	}
 	BEESCOUNTADD(tmp_copy_ms, copy_timer.age() * 1000);
 	bees_throttle(copy_timer.age(), "tmpfile_copy");
 	BEESCOUNT(tmp_copy);
 	return rv;
 }
@@ -528,19 +690,23 @@ operator<<(ostream &os, const siginfo_t &si)
 static sigset_t new_sigset, old_sigset;
 static
 void
-block_term_signal()
+block_signals()
 {
 	BEESLOGDEBUG("Masking signals");
 	DIE_IF_NON_ZERO(sigemptyset(&new_sigset));
 	DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGTERM));
 	DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGINT));
 	DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGUSR1));
 	DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGUSR2));
 	DIE_IF_NON_ZERO(sigprocmask(SIG_BLOCK, &new_sigset, &old_sigset));
 }
 static
 void
-wait_for_term_signal()
+wait_for_signals()
 {
 	BEESNOTE("waiting for signals");
 	BEESLOGDEBUG("Waiting for signals...");
@@ -557,14 +723,28 @@ wait_for_term_signal()
 			THROW_ERRNO("sigwaitinfo errno = " << errno);
 		} else {
 			BEESLOGNOTICE("Received signal " << rv << " info " << info);
-			// Unblock so we die immediately if signalled again
+			// If SIGTERM or SIGINT, unblock so we die immediately if signalled again
-			DIE_IF_NON_ZERO(sigprocmask(SIG_BLOCK, &old_sigset, &new_sigset));
+			switch (info.si_signo) {
-			break;
+				case SIGUSR1:
 					BEESLOGNOTICE("Received SIGUSR1 - pausing workers");
 					TaskMaster::pause(true);
 					break;
 				case SIGUSR2:
 					BEESLOGNOTICE("Received SIGUSR2 - unpausing workers");
 					TaskMaster::pause(false);
 					break;
 				case SIGTERM:
 				case SIGINT:
 				default:
 					DIE_IF_NON_ZERO(sigprocmask(SIG_BLOCK, &old_sigset, &new_sigset));
 					BEESLOGDEBUG("Signal catcher exiting");
 					return;
 			}
 		}
 	}
 	BEESLOGDEBUG("Signal catcher exiting");
 }
 static
 int
 bees_main(int argc, char *argv[])
 {
@@ -573,7 +753,7 @@ bees_main(int argc, char *argv[])
 			BEESLOGDEBUG("exception (ignored): " << s);
 			BEESCOUNT(exception_caught_silent);
 		} else {
-			BEESLOGNOTICE("\n\n*** EXCEPTION ***\n\t" << s << "\n***\n");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: EXCEPTION: " << s);
 			BEESCOUNT(exception_caught);
 		}
 	});
@@ -588,47 +768,51 @@ bees_main(int argc, char *argv[])
 	// Have to block signals now before we create a bunch of threads
 	// so the threads will also have the signals blocked.
-	block_term_signal();
+	block_signals();
 	// Create a context so we can apply configuration to it
 	shared_ptr<BeesContext> bc = make_shared<BeesContext>();
 	BEESLOGDEBUG("context constructed");
 	string cwd(readlink_or_die("/proc/self/cwd"));
 	// Defaults
 	bool use_relative_paths = false;
 	bool chatter_prefix_timestamp = true;
 	double thread_factor = 0;
 	unsigned thread_count = 0;
 	unsigned thread_min = 0;
 	double load_target = 0;
 	bool workaround_btrfs_send = false;
-	BeesRoots::ScanMode root_scan_mode = BeesRoots::SCAN_MODE_INDEPENDENT;
+	BeesRoots::ScanMode root_scan_mode = BeesRoots::SCAN_MODE_EXTENT;
 	// Configure getopt_long
 	// Options with no short form
 	enum {
 		BEES_OPT_THROTTLE_FACTOR = 256,
 	};
 	static const struct option long_options[] = {
-		{ "thread-factor",         required_argument, NULL, 'C' },
+		{ .name = "thread-factor",         .has_arg = required_argument, .val = 'C' },
-		{ "thread-min",            required_argument, NULL, 'G' },
+		{ .name = "throttle-factor",       .has_arg = required_argument, .val = BEES_OPT_THROTTLE_FACTOR },
-		{ "strip-paths",           no_argument,       NULL, 'P' },
+		{ .name = "thread-min",            .has_arg = required_argument, .val = 'G' },
-		{ "no-timestamps",         no_argument,       NULL, 'T' },
+		{ .name = "strip-paths",           .has_arg = no_argument,       .val = 'P' },
-		{ "workaround-btrfs-send", no_argument,       NULL, 'a' },
+		{ .name = "no-timestamps",         .has_arg = no_argument,       .val = 'T' },
-		{ "thread-count",          required_argument, NULL, 'c' },
+		{ .name = "workaround-btrfs-send", .has_arg = no_argument,       .val = 'a' },
-		{ "loadavg-target",        required_argument, NULL, 'g' },
+		{ .name = "thread-count",          .has_arg = required_argument, .val = 'c' },
-		{ "help",                  no_argument,       NULL, 'h' },
+		{ .name = "loadavg-target",        .has_arg = required_argument, .val = 'g' },
-		{ "scan-mode",             required_argument, NULL, 'm' },
+		{ .name = "help",                  .has_arg = no_argument,       .val = 'h' },
-		{ "absolute-paths",        no_argument,       NULL, 'p' },
+		{ .name = "scan-mode",             .has_arg = required_argument, .val = 'm' },
-		{ "timestamps",            no_argument,       NULL, 't' },
+		{ .name = "absolute-paths",        .has_arg = no_argument,       .val = 'p' },
-		{ "verbose",               required_argument, NULL, 'v' },
+		{ .name = "timestamps",            .has_arg = no_argument,       .val = 't' },
-		{ 0, 0, 0, 0 },
+		{ .name = "verbose",               .has_arg = required_argument, .val = 'v' },
 		{ 0 },
 	};
 	// Build getopt_long's short option list from the long_options table.
 	// While we're at it, make sure we didn't duplicate any options.
 	string getopt_list;
-	set<decltype(option::val)> option_vals;
+	map<decltype(option::val), string> option_vals;
 	for (const struct option *op = long_options; op->val; ++op) {
-		THROW_CHECK1(runtime_error, op->val, !option_vals.count(op->val));
+		const auto ins_rv = option_vals.insert(make_pair(op->val, op->name));
-		option_vals.insert(op->val);
+		THROW_CHECK1(runtime_error, op->val, ins_rv.second);
 		if ((op->val & 0xff) != op->val) {
 			continue;
 		}
@@ -639,27 +823,31 @@ bees_main(int argc, char *argv[])
 	}
 	// Parse options
 	int c;
 	while (true) {
 		int option_index = 0;
-		c = getopt_long(argc, argv, getopt_list.c_str(), long_options, &option_index);
+		const auto c = getopt_long(argc, argv, getopt_list.c_str(), long_options, &option_index);
 		if (-1 == c) {
 			break;
 		}
-		BEESLOGDEBUG("Parsing option '" << static_cast<char>(c) << "'");
+		// getopt_long should have weeded out any invalid options,
 		// so we can go ahead and throw here
 		BEESLOGDEBUG("Parsing option '" << option_vals.at(c) << "'");
 		switch (c) {
 			case 'C':
 				thread_factor = stod(optarg);
 				break;
 			case BEES_OPT_THROTTLE_FACTOR:
 				bees_throttle_factor = stod(optarg);
 				break;
 			case 'G':
 				thread_min = stoul(optarg);
 				break;
 			case 'P':
-				crucible::set_relative_path(cwd);
+				use_relative_paths = true;
 				break;
 			case 'T':
 				chatter_prefix_timestamp = false;
@@ -677,7 +865,7 @@ bees_main(int argc, char *argv[])
 				root_scan_mode = static_cast<BeesRoots::ScanMode>(stoul(optarg));
 				break;
 			case 'p':
-				crucible::set_relative_path("");
+				use_relative_paths = false;
 				break;
 			case 't':
 				chatter_prefix_timestamp = true;
@@ -695,12 +883,12 @@ bees_main(int argc, char *argv[])
 			case 'h':
 			default:
 				do_cmd_help(argv);
-				return EXIT_FAILURE;
+				return EXIT_SUCCESS;
 		}
 	}
 	if (optind + 1 != argc) {
-		BEESLOGERR("Only one filesystem path per bees process");
+		BEESLOGERR("Exactly one filesystem path required");
 		return EXIT_FAILURE;
 	}
@@ -740,22 +928,32 @@ bees_main(int argc, char *argv[])
 	BEESLOGNOTICE("setting worker thread pool maximum size to " << thread_count);
 	TaskMaster::set_thread_count(thread_count);
 	BEESLOGNOTICE("setting throttle factor to " << bees_throttle_factor);
 	// Set root path
 	string root_path = argv[optind++];
 	BEESLOGNOTICE("setting root path to '" << root_path << "'");
 	bc->set_root_path(root_path);
 	// Set path prefix
 	if (use_relative_paths) {
 		crucible::set_relative_path(name_fd(bc->root_fd()));
 	}
 	// Workaround for btrfs send
 	bc->roots()->set_workaround_btrfs_send(workaround_btrfs_send);
 	// Set root scan mode
 	bc->roots()->set_scan_mode(root_scan_mode);
 	// Workaround for the logical-ino-vs-clone kernel bug
 	MultiLocker::enable_locking(true);
 	// Start crawlers
 	bc->start();
 	// Now we just wait forever
-	wait_for_term_signal();
+	wait_for_signals();
 	// Shut it down
 	bc->stop();
@@ -78,13 +78,13 @@ const int BEES_PROGRESS_INTERVAL = BEES_STATS_INTERVAL;
 const int BEES_STATUS_INTERVAL = 1;
 // Number of file FDs to cache when not in active use
-const size_t BEES_FILE_FD_CACHE_SIZE = 4096;
+const size_t BEES_FILE_FD_CACHE_SIZE = 524288;
 // Number of root FDs to cache when not in active use
-const size_t BEES_ROOT_FD_CACHE_SIZE = 1024;
+const size_t BEES_ROOT_FD_CACHE_SIZE = 65536;
 // Number of FDs to open (rlimit)
-const size_t BEES_OPEN_FILE_LIMIT = (BEES_FILE_FD_CACHE_SIZE + BEES_ROOT_FD_CACHE_SIZE) * 2 + 100;
+const size_t BEES_OPEN_FILE_LIMIT = BEES_FILE_FD_CACHE_SIZE + BEES_ROOT_FD_CACHE_SIZE + 100;
 // Worker thread factor (multiplied by detected number of CPU cores)
 const double BEES_DEFAULT_THREAD_FACTOR = 1.0;
@@ -93,10 +93,11 @@ const double BEES_DEFAULT_THREAD_FACTOR = 1.0;
 const double BEES_TOO_LONG = 5.0;
 // Avoid any extent where LOGICAL_INO takes this much kernel CPU time
-const double BEES_TOXIC_SYS_DURATION = 0.1;
+const double BEES_TOXIC_SYS_DURATION = 5.0;
-// Maximum number of refs to a single extent
+// Maximum number of refs to a single extent before we have other problems
-const size_t BEES_MAX_EXTENT_REF_COUNT = (16 * 1024 * 1024 / 24) - 1;
+// If we have more than 10K refs to an extent, adding another will save 0.01% space
 const size_t BEES_MAX_EXTENT_REF_COUNT = 9999; // (16 * 1024 * 1024 / 24);
 // How long between hash table histograms
 const double BEES_HASH_TABLE_ANALYZE_INTERVAL = BEES_STATS_INTERVAL;
@@ -121,9 +122,9 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
 // macros ----------------------------------------
 #define BEESLOG(lv,x)   do { if (lv < bees_log_level) { Chatter __chatter(lv, BeesNote::get_name()); __chatter << x; } } while (0)
 #define BEESLOGTRACE(x) do { BEESLOG(LOG_DEBUG, x); BeesTracer::trace_now(); } while (0)
-#define BEESTRACE(x)   BeesTracer  SRSLY_WTF_C(beesTracer_,  __LINE__) ([&]()                 { BEESLOG(LOG_ERR, x);   })
+#define BEES_TRACE_LEVEL LOG_DEBUG
 #define BEESTRACE(x)   BeesTracer  SRSLY_WTF_C(beesTracer_,  __LINE__) ([&]()                 { BEESLOG(BEES_TRACE_LEVEL, "TRACE: " << x << " at " << __FILE__ << ":" << __LINE__);   })
 #define BEESTOOLONG(x) BeesTooLong SRSLY_WTF_C(beesTooLong_, __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
 #define BEESNOTE(x)    BeesNote    SRSLY_WTF_C(beesNote_,    __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
@@ -133,6 +134,14 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
 #define BEESLOGINFO(x)   BEESLOG(LOG_INFO, x)
 #define BEESLOGDEBUG(x)  BEESLOG(LOG_DEBUG, x)
 #define BEESLOGONCE(__x) do { \
        static bool already_logged = false; \
        if (!already_logged) { \
                already_logged = true; \
                BEESLOGNOTICE(__x); \
        } \
 } while (false)
 #define BEESCOUNT(stat) do { \
 	BeesStats::s_global.add_count(#stat); \
 } while (0)
@@ -184,7 +193,7 @@ class BeesTracer {
 	thread_local static bool tl_silent;
 	thread_local static bool tl_first;
 public:
-	BeesTracer(function<void()> f, bool silent = false);
+	BeesTracer(const function<void()> &f, bool silent = false);
 	~BeesTracer();
 	static void trace_now();
 	static bool get_silent();
@@ -299,6 +308,11 @@ public:
 	off_t grow_begin(off_t delta);
 	/// @}
 	/// @{ Make range smaller
 	off_t shrink_end(off_t delta);
 	off_t shrink_begin(off_t delta);
 	/// @}
 friend ostream & operator<<(ostream &os, const BeesFileRange &bfr);
 };
@@ -515,7 +529,7 @@ class BeesCrawl {
 	bool fetch_extents();
 	void fetch_extents_harder();
-	bool next_transid();
+	bool restart_crawl_unlocked();
 	BeesFileRange bti_to_bfr(const BtrfsTreeItem &bti) const;
 public:
@@ -527,6 +541,9 @@ public:
 	BeesCrawlState get_state_end() const;
 	void set_state(const BeesCrawlState &bcs);
 	void deferred(bool def_setting);
 	bool deferred() const;
 	bool finished() const;
 	bool restart_crawl();
 };
 class BeesScanMode;
@@ -535,7 +552,8 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	shared_ptr<BeesContext>			m_ctx;
 	BeesStringFile				m_crawl_state_file;
-	map<uint64_t, shared_ptr<BeesCrawl>>	m_root_crawl_map;
+	using CrawlMap = map<uint64_t, shared_ptr<BeesCrawl>>;
 	CrawlMap				m_root_crawl_map;
 	mutex					m_mutex;
 	uint64_t				m_crawl_dirty = 0;
 	uint64_t				m_crawl_clean = 0;
@@ -554,17 +572,13 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	condition_variable			m_stop_condvar;
 	bool					m_stop_requested = false;
-	void insert_new_crawl();
+	CrawlMap insert_new_crawl();
 	void insert_root(const BeesCrawlState &bcs);
 	Fd open_root_nocache(uint64_t root);
 	Fd open_root_ino_nocache(uint64_t root, uint64_t ino);
 	uint64_t transid_min();
 	uint64_t transid_max();
 	uint64_t transid_max_nocache();
 	void state_load();
 	ostream &state_to_stream(ostream &os);
 	void state_save();
 	bool crawl_roots();
 	string crawl_state_filename() const;
 	void crawl_state_set_dirty();
 	void crawl_state_erase(const BeesCrawlState &bcs);
@@ -572,13 +586,16 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	void writeback_thread();
 	uint64_t next_root(uint64_t root = 0);
 	void current_state_set(const BeesCrawlState &bcs);
 	RateEstimator& transid_re();
 	bool crawl_batch(shared_ptr<BeesCrawl> crawl);
 	void clear_caches();
 	shared_ptr<BeesCrawl> insert_root(const BeesCrawlState &bcs);
 	bool up_to_date(const BeesCrawlState &bcs);
 friend class BeesCrawl;
 friend class BeesFdCache;
 friend class BeesScanMode;
 friend class BeesScanModeSubvol;
 friend class BeesScanModeExtent;
 public:
 	BeesRoots(shared_ptr<BeesContext> ctx);
@@ -594,17 +611,22 @@ public:
 	Fd open_root_ino(const BeesFileId &bfi) { return open_root_ino(bfi.root(), bfi.ino()); }
 	bool is_root_ro(uint64_t root);
 	// TODO:  do extent-tree scans instead
 	enum ScanMode {
 		SCAN_MODE_LOCKSTEP,
 		SCAN_MODE_INDEPENDENT,
 		SCAN_MODE_SEQUENTIAL,
 		SCAN_MODE_RECENT,
 		SCAN_MODE_EXTENT,
 		SCAN_MODE_COUNT, // must be last
 	};
 	void set_scan_mode(ScanMode new_mode);
 	void set_workaround_btrfs_send(bool do_avoid);
 	uint64_t transid_min();
 	uint64_t transid_max();
 	void wait_for_transid(const uint64_t count);
 };
 struct BeesHash {
@@ -664,6 +686,8 @@ class BeesRangePair : public pair<BeesFileRange, BeesFileRange> {
 public:
 	BeesRangePair(const BeesFileRange &src, const BeesFileRange &dst);
 	bool grow(shared_ptr<BeesContext> ctx, bool constrained);
 	void shrink_begin(const off_t delta);
 	void shrink_end(const off_t delta);
 	BeesRangePair copy_closed() const;
 	bool operator<(const BeesRangePair &that) const;
 friend ostream & operator<<(ostream &os, const BeesRangePair &brp);
@@ -737,11 +761,14 @@ class BeesContext : public enable_shared_from_this<BeesContext> {
 	shared_ptr<BeesThread>				m_progress_thread;
 	shared_ptr<BeesThread>				m_status_thread;
 	mutex						m_progress_mtx;
 	string						m_progress_str;
 	void set_root_fd(Fd fd);
 	BeesResolveAddrResult resolve_addr_uncached(BeesAddress addr);
-	BeesFileRange scan_one_extent(const BeesFileRange &bfr, const Extent &e);
+	void scan_one_extent(const BeesFileRange &bfr, const Extent &e);
 	void rewrite_file_range(const BeesFileRange &bfr);
 public:
@@ -772,6 +799,8 @@ public:
 	void dump_status();
 	void show_progress();
 	void set_progress(const string &str);
 	string get_progress();
 	void start();
 	void stop();
@@ -834,7 +863,7 @@ public:
 	BeesFileRange find_one_match(BeesHash hash);
 	void replace_src(const BeesFileRange &src_bfr);
-	BeesFileRange replace_dst(const BeesFileRange &dst_bfr);
+	BeesRangePair replace_dst(const BeesFileRange &dst_bfr);
 	bool found_addr() const { return m_found_addr; }
 	bool found_data() const { return m_found_data; }
@@ -868,7 +897,10 @@ extern const char *BEES_VERSION;
 extern thread_local default_random_engine bees_generator;
 string pretty(double d);
 void bees_readahead(int fd, off_t offset, size_t size);
 void bees_readahead_pair(int fd, off_t offset, size_t size, int fd2, off_t offset2, size_t size2);
 void bees_unreadahead(int fd, off_t offset, size_t size);
 void bees_throttle(double time_used, const char *context);
 string format_time(time_t t);
 bool exception_check();
 #endif
@@ -8,6 +8,7 @@ PROGRAMS = \
 	process \
 	progress \
 	seeker \
 	table \
 	task \
 all: test
@@ -19,7 +19,9 @@ seeker_finder(const vector<uint64_t> &vec, uint64_t lower, uint64_t upper)
 	if (ub != s.end()) ++ub;
 	if (ub != s.end()) ++ub;
 	for (; ub != s.end(); ++ub) {
-		if (*ub > upper) break;
+		if (*ub > upper) {
 			break;
 		}
 	}
 	return set<uint64_t>(lb, ub);
 }
@@ -28,7 +30,7 @@ static bool test_fails = false;
 static
 void
-seeker_test(const vector<uint64_t> &vec, uint64_t const target)
+seeker_test(const vector<uint64_t> &vec, uint64_t const target, bool const always_out = false)
 {
 	cerr << "Find " << target << " in {";
 	for (auto i : vec) {
@@ -36,11 +38,13 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
 	}
 	cerr << " } = ";
 	size_t loops = 0;
 	tl_seeker_debug_str = make_shared<ostringstream>();
 	bool local_test_fails = false;
 	bool excepted = catch_all([&]() {
-		auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
+		const auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
 			++loops;
 			return seeker_finder(vec, lower, upper);
-		});
+		}, uint64_t(32));
 		cerr << found;
 		uint64_t my_found = 0;
 		for (auto i : vec) {
@@ -52,13 +56,15 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
 			cerr << " (correct)";
 		} else {
 			cerr << " (INCORRECT - right answer is " << my_found << ")";
-			test_fails = true;
+			local_test_fails = true;
 		}
 	});
 	cerr << " (" << loops << " loops)" << endl;
-	if (excepted) {
+	if (excepted || local_test_fails || always_out) {
-		test_fails = true;
+		cerr << dynamic_pointer_cast<ostringstream>(tl_seeker_debug_str)->str();
 	}
 	test_fails = test_fails || local_test_fails;
 	tl_seeker_debug_str.reset();
 }
 static
@@ -89,6 +95,39 @@ test_seeker()
 	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max());
 	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max() - 1);
 	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() - 1 }, numeric_limits<uint64_t>::max());
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 0);
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 1);
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 2);
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 3);
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 4);
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 5);
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 6);
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 7);
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 8);
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 9);
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() );
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 1 );
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 2 );
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 3 );
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 4 );
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 5 );
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 6 );
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 7 );
 	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 8 );
 	// Pulled from a bees debug log
 	seeker_test(vector<uint64_t> {
 		6821962845,
 		6821962848,
 		6821963411,
 		6821963422,
 		6821963536,
 		6821963539,
 		6821963835, // <- appeared during the search, causing an exception
 		6821963841,
 		6822575316,
 	}, 6821971036, true);
 }
@@ -0,0 +1,63 @@
 #include "tests.h"
 #include "crucible/table.h"
 using namespace crucible;
 using namespace std;
 void
 print_table(const Table::Table& t)
 {
 	cerr << "BEGIN TABLE\n";
 	cerr << t;
 	cerr << "END TABLE\n";
 	cerr << endl;
 }
 void
 test_table()
 {
 	Table::Table t;
 	t.insert_row(Table::endpos, vector<Table::Content> {
 		Table::Text("Hello, World!"),
 		Table::Text("2"),
 		Table::Text("3"),
 		Table::Text("4"),
 	});
 	print_table(t);
 	t.insert_row(Table::endpos, vector<Table::Content> {
 		Table::Text("Greeting"),
 		Table::Text("two"),
 		Table::Text("three"),
 		Table::Text("four"),
 	});
 	print_table(t);
 	t.insert_row(Table::endpos, vector<Table::Content> {
 		Table::Fill('-'),
 		Table::Text("ii"),
 		Table::Text("iii"),
 		Table::Text("iv"),
 	});
 	print_table(t);
 	t.mid(" | ");
 	t.left("| ");
 	t.right(" |");
 	print_table(t);
 	t.insert_col(1, vector<Table::Content> {
 		Table::Text("1"),
 		Table::Text("one"),
 		Table::Text("i"),
 		Table::Text("I"),
 	});
 	print_table(t);
 	t.at(2, 1) = Table::Text("Two\nLines");
 	print_table(t);
 }
 int
 main(int, char**)
 {
 	RUN_A_TEST(test_table());
 	exit(EXIT_SUCCESS);
 }