fsync: fix signed comparison of stf.f_type

Build fails on 32-bit Slackware because GCC 11's `-Werror=sign-compare` is stricter than necessary: cc -Wall -Wextra -Werror -O3 -I../include -D_FILE_OFFSET_BITS=64 -std=c99 -O2 -march=i586 -mtune=i686 -o bees-version.o -c bees-version.c bees.cc: In function 'void bees_fsync(int)': bees.cc:426:24: error: comparison of integer expressions of different signedness: '__fsword_t' {aka 'int'} and 'unsigned int' [-Werror=sign-compare] 426 | if (stf.f_type != BTRFS_SUPER_MAGIC) { | ^ To work around this, cast `stf.f_type` to the same type as `BTRFS_SUPER_MAGIC`, so it has the same number of bits that we're looking for in the magic value. Fixes: https://github.com/Zygo/bees/issues/317 Signed-off-by: Zygo Blaxell <bees@furryterror.org>
tempfile: don't need to update the inode if the flags don't change
2026-01-08 20:00:22 +01:00 · 2025-07-03 21:48:40 -04:00 · 2025-06-29 23:34:10 -04:00 · 2025-06-29 23:25:36 -04:00 · 2025-06-29 23:24:55 -04:00 · 2025-06-18 23:06:14 -04:00
50 changed files with 3775 additions and 1226 deletions
@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
 About bees
 ----------

-bees is a block-oriented userspace deduplication agent designed for large
-btrfs filesystems.  It is an offline dedupe combined with an incremental
-data scan capability to minimize time data spends on disk from write
-to dedupe.
+bees is a block-oriented userspace deduplication agent designed to scale
+up to large btrfs filesystems.  It is an offline dedupe combined with
+an incremental data scan capability to minimize time data spends on disk
+from write to dedupe.

 Strengths
 ---------

- * Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
- * Daemon incrementally dedupes new data using btrfs tree search
+ * Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
+ * Daemon mode - incrementally dedupes new data as it appears
+ * Largest extents first - recover more free space during fixed maintenance windows
 * Works with btrfs compression - dedupe any combination of compressed and uncompressed files
- * Works around btrfs filesystem structure to free more disk space
+ * Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
 * Persistent hash table for rapid restart after shutdown
- * Whole-filesystem dedupe - including snapshots
 * Constant hash table size - no increased RAM usage if data set becomes larger
 * Works on live data - no scheduled downtime required
- * Automatic self-throttling based on system load
+ * Automatic self-throttling - reduces system load
+ * btrfs support - recovers more free space from btrfs than naive dedupers

 Weaknesses
 ----------

 * Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
- * Requires root privilege (or `CAP_SYS_ADMIN`)
- * First run may require temporary disk space for extent reorganization
+ * Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
 * [First run may increase metadata space usage if many snapshots exist](docs/gotchas.md)
 * Constant hash table size - no decreased RAM usage if data set becomes smaller
 * btrfs only
@@ -46,7 +46,7 @@ Recommended Reading
 -------------------

 * [bees Gotchas](docs/gotchas.md)
- * [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING
+ * [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
 * [bees vs. other btrfs features](docs/btrfs-other.md)
 * [What to do when something goes wrong](docs/wrong.md)

@@ -69,6 +69,6 @@ You can also use Github:
 Copyright & License
 -------------------

-Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
+Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.

 GPL (version 3 or later).
@@ -1,31 +1,24 @@
-Recommended Kernel Version for bees
-===================================
+Recommended Linux Kernel Version for bees
+=========================================

-First, a warning that is not specific to bees:
+First, a warning about old Linux kernel versions:

-> **Kernel 5.1, 5.2, and 5.3 should not be used with btrfs due to a
-severe regression that can lead to fatal metadata corruption.**
-This issue is fixed in kernel 5.4.14 and later.
+> **Linux kernel version 5.1, 5.2, and 5.3 should not be used with btrfs
+due to a severe regression that can lead to fatal metadata corruption.**
+This issue is fixed in version 5.4.14 and later.

-**Recommended kernel versions for bees are 4.19, 5.4, 5.10, 5.11, 5.15,
-6.0, or 6.1, with recent LTS and -stable updates.**  The latest released
-kernel as of this writing is 6.4.1.
+**Recommended Linux kernel versions for bees are 5.4, 5.10, 5.15, 6.1,
+6.6, or 6.12 with recent LTS and -stable updates.**  The latest released
+kernel as of this writing is 6.12.9, and the earliest supported LTS
+kernel is 5.4.

-4.14, 4.9, and 4.4 LTS kernels with recent updates are OK with some
-issues.  Older kernels will be slower (a little slower or a lot slower
-depending on which issues are triggered).  Not all fixes are backported.
-
-Obsolete non-LTS kernels have a variety of unfixed issues and should
-not be used with btrfs.  For details see the table below.
-
-bees requires btrfs kernel API version 4.2 or higher, and does not work
-at all on older kernels.
-
-Some bees features rely on kernel 4.15 to work, and these features will
-not be available on older kernels.  Currently, bees is still usable on
-older kernels with degraded performance or with options disabled, but
-support for older kernels may be removed.
+Some optional bees features use kernel APIs introduced in kernel 4.15
+(extent scan) and 5.6 (`openat2` support).  These bees features are not
+available on older kernels.  Support for older kernels may be removed
+in a future bees release.

+bees will not run at all on kernels before 4.2 due to lack of minimal
+API support.



@@ -62,14 +55,17 @@ These bugs are particularly popular among bees users, though not all are specifi
 | 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.4.125, 5.10.43, 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
 | - | 5.11 | tree mod log issue #5 | 4.4.263, 4.9.263, 4.14.227, 4.19.183, 5.4.108, 5.10.26, 5.11.9, 5.12 and later | dbcc7d57bffc btrfs: fix race when cloning extent buffer during rewind of an old root
 | - | 5.12 | tree mod log issue #6 | 4.14.233, 4.19.191, 5.4.118, 5.10.36, 5.11.20, 5.12.3, 5.13 and later | f9690f426b21 btrfs: fix race when picking most recent mod log operation for an old root
+| 5.11 | 5.12 | subvols marked for deletion with `btrfs sub del` become permanently undeletable ("ghost" subvols) | 5.12 stopped creation of new ghost subvols | Partially fixed in 8d488a8c7ba2 btrfs: fix subvolume/snapshot deletion not triggered on mount.  Qu wrote a [patch](https://github.com/adam900710/linux/commit/9de990fcc8864c376eb28aa7482c54321f94acd4) to allow `btrfs sub del -i` to remove "ghost" subvols, but it was never merged upstream.
 | 4.15 | 5.16 | spurious warnings from `fs/fs-writeback.c` when `flushoncommit` is enabled | 5.15.27, 5.16.13, 5.17 and later | a0f0cf8341e3 btrfs: get rid of warning on transaction commit when using flushoncommit
 | - | 5.17 | crash during device removal can make filesystem unmountable | 5.15.54, 5.16.20, 5.17.3, 5.18 and later | bbac58698a55 btrfs: remove device item and update super block in the same transaction
 | - | 5.18 | wrong superblock num_devices makes filesystem unmountable | 4.14.283, 4.19.247, 5.4.198, 5.10.121, 5.15.46, 5.17.14, 5.18.3, 5.19 and later | d201238ccd2f btrfs: repair super block num_devices automatically
 | 5.18 | 5.19 | parent transid verify failed during log tree replay after a crash during a rename operation | 5.18.18, 5.19.2, 6.0 and later | 723df2bcc9e1 btrfs: join running log transaction when logging new name
 | 5.12 | 6.0 | space cache corruption and potential double allocations | 5.15.65, 5.19.6, 6.0 and later | ced8ecf026fd btrfs: fix space cache corruption and potential double allocations
+| 6.0 | 6.5 | suboptimal allocation in multi-device filesystems due to chunk allocator regression | 6.1.60, 6.5.9, 6.6 and later | 8a540e990d7d btrfs: fix stripe length calculation for non-zoned data chunk allocation
 | 6.3, backported to 5.15.107, 6.1.24, 6.2.11 | 6.3 | vmalloc error, failed to allocate pages | 6.3.10, 6.4 and later.  Bug (f349b15e183d "mm: vmalloc: avoid warn_alloc noise caused by fatal signal" in v6.3-rc6) backported to 6.1.24, 6.2.11, and 5.15.107. | 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
 | 6.2 | 6.3 | `IGNORE_OFFSET` flag ignored in `LOGICAL_INO` ioctl | 6.2.16, 6.3.3, 6.4 and later | 0cad8f14d70c btrfs: fix backref walking not returning all inode refs
-| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe ioctl on the same extent | - | workaround: avoid doing that
+| 6.10 | 6.11 | `adding refs to an existing tree ref`, `failed to run delayed ref`, then read-only | 6.11.10, 6.12 and later | 7d493a5ecc26 btrfs: fix incorrect comparison for delayed refs
+| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe/clone ioctl on the same extent | - | workaround: avoid doing that

 "Last bad kernel" refers to that version's last stable update from
 kernel.org.  Distro kernels may backport additional fixes.  Consult
@@ -95,12 +91,12 @@ contains the last committed component of the fix.
 Workarounds for known kernel bugs
 ---------------------------------

-* **Hangs with concurrent `LOGICAL_INO` and dedupe**:  on all
-  kernel versions so far, multiple threads running `LOGICAL_INO`
-  and dedupe ioctls at the same time on the same inodes or extents
+* **Hangs with concurrent `LOGICAL_INO` and dedupe/clone**:  on all
+  kernel versions so far, multiple threads running `LOGICAL_INO` and
+  dedupe/clone ioctls at the same time on the same inodes or extents
  can lead to a kernel hang.  The kernel enters an infinite loop in
  `add_all_parents`, where `count` is 0, `ref->count` is 1, and
-  `btrfs_next_item` or `btrfs_next_old_item` never find a matching ref).
+  `btrfs_next_item` or `btrfs_next_old_item` never find a matching ref.

  bees has two workarounds for this bug: 1. schedule work so that multiple
  threads do not simultaneously access the same inode or the same extent,
@@ -121,58 +117,32 @@ Workarounds for known kernel bugs

  It is still theoretically possible to trigger the kernel bug when
  running bees at the same time as other dedupers, or other programs
-  that use `LOGICAL_INO` like `btdu`; however, it's extremely difficult
-  to reproduce the bug without closely cooperating threads.
+  that use `LOGICAL_INO` like `btdu`, or when performing a reflink clone
+  operation such as `cp` or `mv`; however, it's extremely difficult to
+  reproduce the bug without closely cooperating threads.

-* **Slow backrefs** (aka toxic extents):  Under certain conditions,
-  if the number of references to a single shared extent grows too
-  high, the kernel consumes more and more CPU while also holding locks
-  that delay write access to the filesystem.  bees avoids this bug
-  by measuring the time the kernel spends performing `LOGICAL_INO`
-  operations and permanently blacklisting any extent or hash involved
-  where the kernel starts to get slow.  In the bees log, such blocks
-  are labelled as 'toxic' hash/block addresses.  Toxic extents are
-  rare (about 1 in 100,000 extents become toxic), but toxic extents can
-  become 8 orders of magnitude more expensive to process than the fastest
-  non-toxic extents.  This seems to affect all dedupe agents on btrfs;
-  at this time of writing only bees has a workaround for this bug.
+* **Slow backrefs** (aka toxic extents):  On older kernels, under certain
+  conditions, if the number of references to a single shared extent grows
+  too high, the kernel consumes more and more CPU while also holding
+  locks that delay write access to the filesystem.  This is no longer
+  a concern on kernels after 5.7 (or an up-to-date 5.4 LTS version),
+  but there are still some remains of earlier workarounds for this issue
+  in bees that have not been fully removed.

-  This workaround is less necessary for kernels 5.4.96, 5.7 and later,
-  though the bees workaround can still be triggered on newer kernels
-  by changes in btrfs since kernel version 5.1.
+  bees avoided this bug by measuring the time the kernel spends performing
+  `LOGICAL_INO` operations and permanently blacklisting any extent or
+  hash involved where the kernel starts to get slow.  In the bees log,
+  such blocks are labelled as 'toxic' hash/block addresses.
+
+  Future bees releases will remove toxic extent detection (it only detects
+  false positives now) and clear all previously saved toxic extent bits.

 * **dedupe breaks `btrfs send` in old kernels**.  The bees option
  `--workaround-btrfs-send` prevents any modification of read-only subvols
-  in order to avoid breaking `btrfs send`.
+  in order to avoid breaking `btrfs send` on kernels before 5.2.

-  This workaround is no longer necessary to avoid kernel crashes
-  and send performance failure on kernel 4.9.207, 4.14.159, 4.19.90,
-  5.3.17, 5.4.4, 5.5 and later; however, some conflict between send
-  and dedupe still remains, so the workaround is still useful.
+  This workaround is no longer necessary to avoid kernel crashes and
+  send performance failure on kernel 5.4.4 and later.  bees will pause
+  dedupe until the send is finished on current kernels.

  `btrfs receive` is not and has never been affected by this issue.
-
-Unfixed kernel bugs
-------------------
-
-* **The kernel does not permit `btrfs send` and dedupe to run at the
-  same time**.  Recent kernels no longer crash, but now refuse one
-  operation with an error if the other operation was already running.
-
-  bees has not been updated to handle the new dedupe behavior optimally.
-  Optimal behavior is to defer dedupe operations when send is detected,
-  and resume after the send is finished.  Current bees behavior is to
-  complain loudly about each individual dedupe failure in log messages,
-  and abandon duplicate data references in the snapshot that send is
-  processing.  A future bees version shall have better handling for
-  this situation.
-
-  Workaround:  send `SIGSTOP` to bees, or terminate the bees process,
-  before running `btrfs send`.
-
-  This workaround is not strictly required if snapshot is deleted after
-  sending.  In that case, any duplicate data blocks that were not removed
-  by dedupe will be removed by snapshot delete instead.  The workaround
-  still saves some IO.
-
-  `btrfs receive` is not affected by this issue.
@@ -3,40 +3,34 @@ Good Btrfs Feature Interactions

 bees has been tested in combination with the following:

-* btrfs compression (zlib, lzo, zstd), mixtures of compressed and uncompressed extents
+* btrfs compression (zlib, lzo, zstd)
 * PREALLOC extents (unconditionally replaced with holes)
 * HOLE extents and btrfs no-holes feature
-* Other deduplicators, reflink copies (though bees may decide to redo their work)
-* btrfs snapshots and non-snapshot subvols (RW and RO)
+* Other deduplicators (`duperemove`, `jdupes`)
+* Reflink copies (modern coreutils `cp` and `mv`)
 * Concurrent file modification (e.g. PostgreSQL and sqlite databases, VMs, build daemons)
-* All btrfs RAID profiles
-* IO errors during dedupe (read errors will throw exceptions, bees will catch them and skip over the affected extent)
-* Filesystems mounted with or without the `flushoncommit` option
+* All btrfs RAID profiles:  single, dup, raid0, raid1, raid10, raid1c3, raid1c4, raid5, raid6
+* IO errors during dedupe (affected extents are skipped)
 * 4K filesystem data block size / clone alignment
 * 64-bit and 32-bit LE host CPUs (amd64, x86, arm)
 * Large files (kernel 5.4 or later strongly recommended)
-* Filesystems up to 90T+ bytes, 1000M+ files
+* Filesystem data sizes up to 100T+ bytes, 1000M+ files
+* `open(O_DIRECT)` (seems to work as well--or as poorly--with bees as with any other btrfs feature)
+* btrfs-convert from ext2/3/4
+* btrfs `autodefrag` mount option
+* btrfs balance (data balances cause rescan of relocated data)
+* btrfs block-group-tree
+* btrfs `flushoncommit` and `noflushoncommit` mount options
+* btrfs mixed block groups
+* btrfs `nodatacow`/`nodatasum` inode attribute or mount option (bees skips all nodatasum files)
+* btrfs qgroups and quota support (_not_ squotas)
 * btrfs receive
-* btrfs nodatacow/nodatasum inode attribute or mount option (bees skips all nodatasum files)
-* open(O_DIRECT) (seems to work as well--or as poorly--with bees as with any other btrfs feature)
-* lvm dm-cache, writecache
+* btrfs scrub
+* btrfs send (dedupe pauses automatically, kernel 5.4 or later required)
+* btrfs snapshot, non-snapshot subvols (RW and RO), snapshot delete

-Bad Btrfs Feature Interactions
------------------------------
-
-bees has been tested in combination with the following, and various problems are known:
-
-* btrfs send:  there are bugs in `btrfs send` that can be triggered by
-  bees on old kernels.  The [`--workaround-btrfs-send` option](options.md)
-  works around this issue by preventing bees from modifying read-only
-  snapshots.
-
-* btrfs qgroups:  very slow, sometimes hangs...and it's even worse when
-  bees is running.
-
-* btrfs autodefrag mount option:  bees cannot distinguish autodefrag
-  activity from normal filesystem activity, and may try to undo the
-  autodefrag if duplicate copies of the defragmented data exist.
+**Note:** some btrfs features have minimum kernel versions which are
+higher than the minimum kernel version for bees.

 Untested Btrfs Feature Interactions
 -----------------------------------
@@ -45,10 +39,6 @@ bees has not been tested with the following, and undesirable interactions may oc

 * Non-4K filesystem data block size (should work if recompiled)
 * Non-equal hash (SUM) and filesystem data block (CLONE) sizes (need to fix that eventually)
-* btrfs seed filesystems (no particular reason it wouldn't work, but no one has reported trying)
-* btrfs out-of-tree kernel patches (e.g. in-kernel dedupe, encryption, extent tree v2)
-* btrfs-convert from ext2/3/4 (never tested, might run out of space or ignore significant portions of the filesystem due to sanity checks)
-* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
+* btrfs seed filesystems, raid-stripe-tree, squotas (no particular reason these wouldn't work, but no one has reported trying)
+* btrfs out-of-tree kernel patches (e.g. encryption, extent tree v2)
 * Host CPUs with exotic page sizes, alignment requirements, or endianness (ppc, alpha, sparc, strongarm, s390, mips, m68k...)
-* bcache: used to be in the "bad" list, now in the "untested" list because nobody is rigorously testing, and bcache bugs come and go
-* flashcache: an out-of-tree cache-HDD-on-SSD block layer helper
@@ -26,11 +26,7 @@ Here are some numbers to estimate appropriate hash table sizes:
 Notes:

 * If the hash table is too large, no extra dedupe efficiency is
-obtained, and the extra space wastes RAM.  If the hash table contains
-more block records than there are blocks in the filesystem, the extra
-space can slow bees down.  A table that is too large prevents obsolete
-data from being evicted, so bees wastes time looking for matching data
-that is no longer present on the filesystem.
+obtained, and the extra space wastes RAM.

 * If the hash table is too small, bees extrapolates from matching
 blocks to find matching adjacent blocks in the filesystem that have been
@@ -59,19 +55,19 @@ patterns on dedupe effectiveness without performing deep inspection of
 both the filesystem data and its structure--a task that is as expensive
 as performing the deduplication.

-* **Compression** on the filesystem reduces the average extent length
-compared to uncompressed filesystems.  The maximum compressed extent
-length on btrfs is 128KB, while the maximum uncompressed extent length
-is 128MB.  Longer extents decrease the optimum hash table size while
-shorter extents increase the optimum hash table size because the
-probability of a hash table entry being present (i.e. unevicted) in
-each extent is proportional to the extent length.
+* **Compression** in files reduces the average extent length compared
+to uncompressed files.  The maximum compressed extent length on
+btrfs is 128KB, while the maximum uncompressed extent length is 128MB.
+Longer extents decrease the optimum hash table size while shorter extents
+increase the optimum hash table size, because the probability of a hash
+table entry being present (i.e. unevicted) in each extent is proportional
+to the extent length.

   As a rule of thumb, the optimal hash table size for a compressed
 filesystem is 2-4x larger than the optimal hash table size for the same
-data on an uncompressed filesystem.  Dedupe efficiency falls dramatically
-with hash tables smaller than 128MB/TB as the average dedupe extent size
-is larger than the largest possible compressed extent size (128KB).
+data on an uncompressed filesystem.  Dedupe efficiency falls rapidly with
+hash tables smaller than 128MB/TB as the average dedupe extent size is
+larger than the largest possible compressed extent size (128KB).

 * **Short writes or fragmentation** also shorten the average extent
 length and increase optimum hash table size.  If a database writes to
@@ -98,27 +94,70 @@ code files over and over, so it will need a smaller hash table than a
 backup server which has to refer to the oldest data on the filesystem
 every time a new client machine's data is added to the server.

-Scanning modes for multiple subvols
-----------------------------------
+Scanning modes
+--------------

-The `--scan-mode` option affects how bees schedules worker threads
-between subvolumes.  Scan modes are an experimental feature and will
-likely be deprecated in favor of a better solution.
+The `--scan-mode` option affects how bees iterates over the filesystem,
+schedules extents for scanning, and tracks progress.

-Scan mode can be changed at any time by restarting bees with a different
-mode option.  Scan state tracking is the same for all of the currently
-implemented modes.  The difference between the modes is the order in
-which subvols are selected.
+There are now two kinds of scan mode:  the legacy **subvol** scan modes,
+and the new **extent** scan mode.

-If a filesystem has only one subvolume with data in it, then the
-`--scan-mode` option has no effect.  In this case, there is only one
-subvolume to scan, so worker threads will all scan that one.
+Scan mode can be changed by restarting bees with a different scan mode
+option.

-Within a subvol, there is a single optimal scan order:  files are scanned
-in ascending numerical inode order.  Each worker will scan a different
-inode to avoid having the threads contend with each other for locks.
-File data is read sequentially and in order, but old blocks from earlier
-scans are skipped.
+Extent scan mode:
+
+ * Works with 4.15 and later kernels.
+ * Can estimate progress and provide an ETA.
+ * Can optimize scanning order to dedupe large extents first.
+ * Can keep up with frequent creation and deletion of snapshots.
+
+Subvol scan modes:
+
+ * Work with 4.14 and earlier kernels.
+ * Cannot estimate or report progress.
+ * Cannot optimize scanning order by extent size.
+ * Have problems keeping up with multiple snapshots created during a scan.
+
+The default scan mode is 4, "extent".
+
+If you are using bees for the first time on a filesystem with many
+existing snapshots, you should read about [snapshot gotchas](gotchas.md).
+
+Subvol scan modes
+-----------------
+
+Subvol scan modes are maintained for compatibility with existing
+installations, but will not be developed further.  New installations
+should use extent scan mode instead.
+
+The _quantity_ of text below detailing the shortcomings of each subvol
+scan mode should be informative all by itself.
+
+Subvol scan modes work on any kernel version supported by bees.  They
+are the only scan modes usable on kernel 4.14 and earlier.
+
+The difference between the subvol scan modes is the order in which the
+files from different subvols are fed into the scanner.  They all scan
+files in inode number order, from low to high offset within each inode,
+the same way that a program like `cat` would read files (but skipping
+over old data from earlier btrfs transactions).
+
+If a filesystem has only one subvolume with data in it, then all of
+the subvol scan modes are equivalent.  In this case, there is only one
+subvolume to scan, so every possible ordering of subvols is the same.
+
+The `--workaround-btrfs-send` option pauses scanning subvols that are
+read-only.  If the subvol is made read-write (e.g. with `btrfs prop set
+$subvol ro false`), or if the `--workaround-btrfs-send` option is removed,
+then the scan of that subvol is unpaused and dedupe proceeds normally.
+Space will only be recovered when the last read-only subvol is deleted.
+
+Subvol scan modes cannot efficiently or accurately calculate an ETA for
+completion or estimate progress through the data.  They simply request
+"the next new inode" from btrfs, and they are completed when btrfs says
+there is no next new inode.

 Between subvols, there are several scheduling algorithms with different
 trade-offs:
@@ -126,68 +165,151 @@ trade-offs:
 Scan mode 0, "lockstep", scans the same inode number in each subvol at
 close to the same time.  This is useful if the subvols are snapshots
 with a common ancestor, since the same inode number in each subvol will
-have similar or identical contents.  This maximizes the likelihood
-that all of the references to a snapshot of a file are scanned at
-close to the same time, improving dedupe hit rate and possibly taking
-advantage of VFS caching in the Linux kernel.  If the subvols are
-unrelated (i.e. not snapshots of a single subvol) then this mode does
-not provide significant benefit over random selection.  This mode uses
-smaller amounts of temporary space for shorter periods of time when most
-subvols are snapshots.  When a new snapshot is created, this mode will
-stop scanning other subvols and scan the new snapshot until the same
-inode number is reached in each subvol, which will effectively stop
-dedupe temporarily as this data has already been scanned and deduped
-in the other snapshots.
+have similar or identical contents.  This maximizes the likelihood that
+all of the references to a snapshot of a file are scanned at close to
+the same time, improving dedupe hit rate.  If the subvols are unrelated
+(i.e. not snapshots of a single subvol) then this mode does not provide
+any significant advantage.  This mode uses smaller amounts of temporary
+space for shorter periods of time when most subvols are snapshots.  When a
+new snapshot is created, this mode will stop scanning other subvols and
+scan the new snapshot until the same inode number is reached in each
+subvol, which will effectively stop dedupe temporarily as this data has
+already been scanned and deduped in the other snapshots.

-Scan mode 1, "independent", scans the next inode with new data in each
-subvol.  Each subvol's scanner shares inodes uniformly with all other
-subvol scanners until the subvol has no new inodes left.  This mode makes
-continuous forward progress across the filesystem and provides average
-performance across a variety of workloads, but is slow to respond to new
-data, and may spend a lot of time deduping short-lived subvols that will
-soon be deleted when it is preferable to dedupe long-lived subvols that
-will be the origin of future snapshots.  When a new snapshot is created,
-previous subvol scans continue as before, but the time is now divided
-among one more subvol.
+Scan mode 1, "independent", scans the next inode with new data in
+each subvol.  There is no coordination between the subvols, other than
+round-robin distribution of files from each subvol to each worker thread.
+This mode makes continuous forward progress in all subvols.  When a new
+snapshot is created, previous subvol scans continue as before, but the
+worker threads are now divided among one more subvol.

 Scan mode 2, "sequential", scans one subvol at a time, in numerical subvol
-ID order, processing each subvol completely before proceeding to the
-next subvol.  This avoids spending time scanning short-lived snapshots
-that will be deleted before they can be fully deduped (e.g. those used
-for `btrfs send`).  Scanning is concentrated on older subvols that are
-more likely to be origin subvols for future snapshots, eliminating the
-need to dedupe future snapshots separately.  This mode uses the largest
-amount of temporary space for the longest time, and typically requires
-a larger hash table to maintain dedupe hit rate.
+ID order, processing each subvol completely before proceeding to the next
+subvol.  This avoids spending time scanning short-lived snapshots that
+will be deleted before they can be fully deduped (e.g. those used for
+`btrfs send`).  Scanning starts on older subvols that are more likely
+to be origin subvols for future snapshots, eliminating the need to
+dedupe future snapshots separately.  This mode uses the largest amount
+of temporary space for the longest time, and typically requires a larger
+hash table to maintain dedupe hit rate.

 Scan mode 3, "recent", scans the subvols with the highest `min_transid`
 value first (i.e. the ones that were most recently completely scanned),
 then falls back to "independent" mode to break ties.  This interrupts
-long scans of old subvols to give a rapid dedupe response to new data,
-then returns to the old subvols after the new data is scanned.  It is
-useful for large filesystems with multiple active subvols and rotating
-snapshots, where the first-pass scan can take months, but new duplicate
-data appears every day.
+long scans of old subvols to give a rapid dedupe response to new data
+in previously scanned subvols, then returns to the old subvols after
+the new data is scanned.

-The default scan mode is 1, "independent".
+Extent scan mode
+----------------

-If you are using bees for the first time on a filesystem with many
-existing snapshots, you should read about [snapshot gotchas](gotchas.md).
+Scan mode 4, "extent", scans the extent tree instead of the subvol trees.
+Extent scan mode reads each extent once, regardless of the number of
+reflinks or snapshots.  It adapts to the creation of new snapshots
+and reflinks immediately, without having to revisit old data.
+
+In the extent scan mode, extents are separated into multiple size tiers
+to prioritize large extents over small ones.  Deduping large extents
+keeps the metadata update cost low per block saved, resulting in faster
+dedupe at the start of a scan cycle.  This is important for maximizing
+performance in use cases where bees runs for a limited time, such as
+during an overnight maintenance window.
+
+Once the larger size tiers are completed, dedupe space recovery speeds
+slow down significantly.  It may be desirable to stop bees running once
+the larger size tiers are finished, then start bees running some time
+later after new data has appeared.
+
+Each extent is mapped in physical address order, and all extent references
+are submitted to the scanner at the same time, resulting in much better
+cache behavior and dedupe performance compared to the subvol scan modes.
+
+The "extent" scan mode is not usable on kernels before 4.15 because
+it relies on the `LOGICAL_INO_V2` ioctl added in that kernel release.
+When using bees with an older kernel, only subvol scan modes will work.
+
+Extents are divided into virtual subvols by size, using reserved btrfs
+subvol IDs 250..255.  The size tier groups are:
+ * 250: 32M+1 and larger
+ * 251: 8M+1..32M
+ * 252: 2M+1..8M
+ * 253: 512K+1..2M
+ * 254: 128K+1..512K
+ * 255: 128K and smaller (includes all compressed extents)
+
+Extent scan mode can efficiently calculate dedupe progress within
+the filesystem and estimate an ETA for completion within each size
+tier; however, the accuracy of the ETA can be questionable due to the
+non-uniform distribution of block addresses in a typical user filesystem.
+
+Older versions of bees do not recognize the virtual subvols, so running
+an old bees version after running a new bees version will reset the
+"extent" scan mode's progress in `beescrawl.dat` to the beginning.
+This may change in future bees releases, i.e. extent scans will store
+their checkpoint data somewhere else.
+
+The `--workaround-btrfs-send` option behaves differently in extent
+scan modes:  In extent scan mode, dedupe proceeds on all subvols that are
+read-write, but all subvols that are read-only are excluded from dedupe.
+Space will only be recovered when the last read-only subvol is deleted.
+
+During `btrfs send` all duplicate extents in the sent subvol will not be
+removed (the kernel will reject dedupe commands while send is active,
+and bees currently will not re-issue them after the send is complete).
+It may be preferable to terminate the bees process while running `btrfs
+send` in extent scan mode, and restart bees after the `send` is complete.

 Threads and load management
 ---------------------------

-By default, bees creates one worker thread for each CPU detected.
-These threads then perform scanning and dedupe operations.  The number of
-worker threads can be set with the [`--thread-count` and `--thread-factor`
-options](options.md).
+By default, bees creates one worker thread for each CPU detected.  These
+threads then perform scanning and dedupe operations.  bees attempts to
+maximize the amount of productive work each thread does, until either the
+threads are all continuously busy, or there is no remaining work to do.

-If desired, bees can automatically increase or decrease the number
-of worker threads in response to system load.  This reduces impact on
-the rest of the system by pausing bees when other CPU and IO intensive
-loads are active on the system, and resumes bees when the other loads
-are inactive.  This is configured with the [`--loadavg-target` and
-`--thread-min` options](options.md).
+In many cases it is not desirable to continually run bees at maximum
+performance.  Maximum performance is not necessary if bees can dedupe
+new data faster than it appears on the filesystem.  If it only takes
+bees 10 minutes per day to dedupe all new data on a filesystem, then
+bees doesn't need to run for more than 10 minutes per day.
+
+bees supports a number of options for reducing system load:
+
+ * Run bees for a few hours per day, at an off-peak time (i.e. during
+ a maintenace window), instead of running bees continuously.  Any data
+ added to the filesystem while bees is not running will be scanned when
+ bees restarts.  At the end of the maintenance window, terminate the
+ bees process with SIGTERM to write the hash table and scan position
+ for the next maintenance window.
+
+ * Temporarily pause bees operation by sending the bees process SIGUSR1,
+ and resume operation with SIGUSR2.  This is preferable to freezing
+ and thawing the process, e.g. with freezer cgroups or SIGSTOP/SIGCONT
+ signals, because it allows bees to close open file handles that would
+ otherwise prevent those files from being deleted while bees is frozen.
+
+ * Reduce the number of worker threads with the [`--thread-count` or
+`--thread-factor` options](options.md).  This simply leaves CPU cores
+ idle so that other applications on the host can use them, or to save
+ power.
+
+ * Allow bees to automatically track system load and increase or decrease
+ the number of threads to reach a target system load.  This reduces
+ impact on the rest of the system by pausing bees when other CPU and IO
+ intensive loads are active on the system, and resumes bees when the other
+ loads are inactive.  This is configured with the [`--loadavg-target`
+ and `--thread-min` options](options.md).
+
+ * Allow bees to self-throttle operations that enqueue delayed work
+ within btrfs.  These operations are not well controlled by Linux
+ features such as process priority or IO priority or IO rate-limiting,
+ because the enqueued work is submitted to btrfs several seconds before
+ btrfs performs the work.  By the time btrfs performs the work, it's too
+ late for external throttling to be effective.  The [`--throttle-factor`
+ option](options.md) tracks how long it takes btrfs to complete queued
+ operations, and reduces bees's queued work submission rate to match
+ btrfs's queued work completion rate (or a fraction thereof, to reduce
+ system load).

 Log verbosity
 -------------
@@ -120,10 +120,14 @@ The `crawl` event group consists of operations related to scanning btrfs trees t

 * `crawl_again`: An inode crawl was restarted because the extent was already locked by another running crawl.
 * `crawl_blacklisted`: An extent was not scanned because it belongs to a blacklisted file.
- * `crawl_create`: A new subvol crawler was created.
- * `crawl_done`: One pass over all subvols on the filesystem was completed.
+ * `crawl_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
+ * `crawl_done`: One pass over a subvol was completed.
+ * `crawl_discard_high`: An extent that was too large for the crawler's size tier was discarded.
+ * `crawl_discard_low`: An extent that was too small for the crawler's size tier was discarded.
 * `crawl_empty`: A `TREE_SEARCH_V2` ioctl call failed or returned an empty set (usually because all data in the subvol was scanned).
+ * `crawl_extent`: The extent crawler queued all references to an extent for processing.
 * `crawl_fail`: A `TREE_SEARCH_V2` ioctl call failed.
+ * `crawl_flop`: Small extent items were not skipped because the next extent started at or before the end of the previous extent.
 * `crawl_gen_high`: An extent item in the search results refers to an extent that is newer than the current crawl's `max_transid` allows.
 * `crawl_gen_low`: An extent item in the search results refers to an extent that is older than the current crawl's `min_transid` allows.
 * `crawl_hole`: An extent item in the search results refers to a hole.
@@ -135,8 +139,13 @@ The `crawl` event group consists of operations related to scanning btrfs trees t
 * `crawl_prealloc`: An extent item in the search results refers to a `PREALLOC` extent.
 * `crawl_push`: An extent item in the search results is suitable for scanning and deduplication.
 * `crawl_scan`: An extent item in the search results is submitted to `BeesContext::scan_forward` for scanning and deduplication.
+ * `crawl_skip`: Small extent items were skipped because no extent of sufficient size was found within the minimum search distance.
+ * `crawl_skip_ms`: Time spent skipping small extent items.
 * `crawl_search`: A `TREE_SEARCH_V2` ioctl call was successful.
+ * `crawl_throttled`: Extent scan created too many work queue items and was prevented from creating any more.
+ * `crawl_tree_block`: Extent scan found and skipped a metadata tree block.
 * `crawl_unknown`: An extent item in the search results has an unrecognized type.
+ * `crawl_unthrottled`: Extent scan allowed to create work queue items again.

 dedup
 -----
@@ -162,6 +171,25 @@ The `exception` event group consists of C++ exceptions.  C++ exceptions are thro
 * `exception_caught`: Total number of C++ exceptions thrown and caught by a generic exception handler.
 * `exception_caught_silent`: Total number of "silent" C++ exceptions thrown and caught by a generic exception handler.  These are exceptions which are part of the correct and normal operation of bees.  The exceptions are logged at a lower log level.

+extent
+------
+
+The `extent` event group consists of events that occur within the extent scanner.
+
+ * `extent_deferred_inode`: A lock conflict was detected when two worker threads attempted to manipulate the same inode at the same time.
+ * `extent_empty`: A complete list of references to an extent was created but the list was empty, e.g. because all refs are in deleted inodes or snapshots.
+ * `extent_fail`: An ioctl call to `LOGICAL_INO` failed.
+ * `extent_forward`: An extent reference was submitted for scanning.
+ * `extent_mapped`: A complete map of references to an extent was created and added to the crawl queue.
+ * `extent_ok`: An ioctl call to `LOGICAL_INO` completed successfully.
+ * `extent_overflow`: A complete map of references to an extent exceeded `BEES_MAX_EXTENT_REF_COUNT`, so the extent was dropped.
+ * `extent_ref_missing`: An extent reference reported by `LOGICAL_INO` was not found by later `TREE_SEARCH_V2` calls.
+ * `extent_ref_ok`: One extent reference was queued for scanning.
+ * `extent_restart`: An extent reference was requeued to be scanned again after an active extent lock is released.
+ * `extent_retry`: An extent reference was requeued to be scanned again after an active inode lock is released.
+ * `extent_skip`: A 4K extent with more than 1000 refs was skipped.
+ * `extent_zero`: An ioctl call to `LOGICAL_INO` succeeded, but reported an empty list of extents.
+
 hash
 ----

@@ -180,24 +208,6 @@ The `hash` event group consists of operations related to the bees hash table.
 * `hash_insert`: A `(hash, address)` pair was inserted by `BeesHashTable::push_random_hash_addr`.
 * `hash_lookup`: The hash table was searched for `(hash, address)` pairs matching a given `hash`.

-inserted
--------
-
-The `inserted` event group consists of operations related to storing hash and address data in the hash table (i.e. the hash table client).
-
- * `inserted_block`: Total number of data block references scanned and inserted into the hash table.
- * `inserted_clobbered`: Total number of data block references scanned and eliminated from the filesystem.
-
-matched
-------
-
-The `matched` event group consists of events related to matching incoming data blocks against existing hash table entries.
-
- * `matched_0`: A data block was scanned, hash table entries found, but no matching data blocks on the filesytem located.
- * `matched_1_or_more`: A data block was scanned, hash table entries found, and one or more matching data blocks on the filesystem located.
- * `matched_2_or_more`: A data block was scanned, hash table entries found, and two or more matching data blocks on the filesystem located.
- * `matched_3_or_more`: A data block was scanned, hash table entries found, and three or more matching data blocks on the filesystem located.
-
 open
 ----

@@ -259,12 +269,29 @@ The `pairforward` event group consists of events related to extending matching b
 * `pairforward_try`: Started extending a pair of matching block ranges forward.
 * `pairforward_zero`: A pair of matching block ranges could not be extended backward by one block because the src block contained all zeros and was not compressed.

+progress
+--------
+
+The `progress` event group consists of events related to progress estimation.
+
+ * `progress_no_data_bg`: Failed to retrieve any data block groups from the filesystem.
+ * `progress_not_created`: A crawler for one size tier had not been created for the extent scanner.
+ * `progress_complete`: A crawler for one size tier has completed a scan.
+ * `progress_not_found`: The extent position for a crawler does not correspond to any block group.
+ * `progress_out_of_bg`: The extent position for a crawler does not correspond to any data block group.
+ * `progress_ok`: Table of progress and ETA created successfully.
+
 readahead
 ---------

-The `readahead` event group consists of events related to calls to `posix_fadvise`.
+The `readahead` event group consists of events related to data prefetching (formerly calls to `posix_fadvise` or `readahead`, but now emulated in userspace).

- * `readahead_ms`: Total time spent running `posix_fadvise(..., POSIX_FADV_WILLNEED)` aka `readahead()`.
+ * `readahead_bytes`: Number of bytes prefetched.
+ * `readahead_count`: Number of read calls.
+ * `readahead_clear`: Number of times the duplicate read cache was cleared.
+ * `readahead_fail`: Number of read errors during prefetch.
+ * `readahead_ms`: Total time spent emulating readahead in user-space (kernel readahead is not measured).
+ * `readahead_skip`: Number of times a duplicate read was identified in the cache and skipped.
 * `readahead_unread_ms`: Total time spent running `posix_fadvise(..., POSIX_FADV_DONTNEED)`.

 replacedst
@@ -301,7 +328,7 @@ The `resolve` event group consists of operations related to translating a btrfs
 * `resolve_large`: The `LOGICAL_INO` ioctl returned more than 2730 results (the limit of the v1 ioctl).
 * `resolve_ms`: Total time spent in the `LOGICAL_INO` ioctl (i.e. wallclock time, not kernel CPU time).
 * `resolve_ok`: The `LOGICAL_INO` ioctl returned success.
- * `resolve_overflow`: The `LOGICAL_INO` ioctl returned more than 655050 extents (the limit of the v2 ioctl).
+ * `resolve_overflow`: The `LOGICAL_INO` ioctl returned 9999 or more extents (the limit configured in `bees.h`).
 * `resolve_toxic`: The `LOGICAL_INO` ioctl took more than 0.1 seconds of kernel CPU time.

 root
@@ -329,35 +356,38 @@ The `scan` event group consists of operations related to scanning incoming data.

 * `scan_blacklisted`: A blacklisted extent was passed to `scan_forward` and dropped.
 * `scan_block`: A block of data was scanned.
- * `scan_bump`: After deduping a block range, the scan pointer had to be moved past the end of the deduped byte range.
- * `scan_dup_block`: Number of duplicate blocks deduped.
- * `scan_dup_hit`: A pair of duplicate block ranges was found and removed.
+ * `scan_compressed_no_dedup`: An extent that was compressed contained non-zero, non-duplicate data.
+ * `scan_dup_block`: Number of duplicate block references deduped.
+ * `scan_dup_hit`: A pair of duplicate block ranges was found.
 * `scan_dup_miss`: A pair of duplicate blocks was found in the hash table but not in the filesystem.
- * `scan_eof`: Scan past EOF was attempted.
- * `scan_erase_redundant`: Blocks in the hash table were removed because they were removed from the filesystem by dedupe.
 * `scan_extent`: An extent was scanned (`scan_one_extent`).
- * `scan_extent_tiny`: An extent below 128K that was not the beginning or end of a file was scanned.  No action is currently taken for these--they are merely counted.
 * `scan_forward`: A logical byte range was scanned (`scan_forward`).
 * `scan_found`: An entry was found in the hash table matching a scanned block from the filesystem.
 * `scan_hash_hit`: A block was found on the filesystem corresponding to a block found in the hash table.
 * `scan_hash_miss`: A block was not found on the filesystem corresponding to a block found in the hash table.
- * `scan_hash_preinsert`: A block was prepared for insertion into the hash table.
+ * `scan_hash_preinsert`: A non-zero data block's hash was prepared for possible insertion into the hash table.
+ * `scan_hash_insert`: A non-zero data block's hash was inserted into the hash table.
 * `scan_hole`: A hole extent was found during scan and ignored.
 * `scan_interesting`: An extent had flags that were not recognized by bees and was ignored.
 * `scan_lookup`: A hash was looked up in the hash table.
 * `scan_malign`: A block being scanned matched a hash at EOF in the hash table, but the EOF was not aligned to a block boundary and the two blocks did not have the same length.
- * `scan_no_fd`: References to a block from the hash table were found, but a FD could not be opened.
- * `scan_no_rewrite`: All blocks in an extent were removed by dedupe (i.e. no copies).
 * `scan_push_front`: An entry in the hash table matched a duplicate block, so the entry was moved to the head of its LRU list.
 * `scan_reinsert`: A copied block's hash and block address was inserted into the hash table.
 * `scan_resolve_hit`: A block address in the hash table was successfully resolved to an open FD and offset pair.
 * `scan_resolve_zero`: A block address in the hash table was not resolved to any subvol/inode pair, so the corresponding hash table entry was removed.
 * `scan_rewrite`: A range of bytes in a file was copied, then the copy deduped over the original data.
+ * `scan_root_dead`: A deleted subvol was detected.
+ * `scan_seen_clear`: The list of recently scanned extents reached maximum size and was cleared.
+ * `scan_seen_erase`: An extent reference was modified by scan, so all future references to the extent must be scanned.
+ * `scan_seen_hit`: A scan was skipped because the same extent had recently been scanned.
+ * `scan_seen_insert`: An extent reference was not modified by scan and its hashes have been inserted into the hash table, so all future references to the extent can be ignored.
+ * `scan_seen_miss`: A scan was not skipped because the same extent had not recently been scanned (i.e. the extent was scanned normally).
+ * `scan_skip_bytes`: Nuisance dedupe or hole-punching would save less than half of the data in an extent.
+ * `scan_skip_ops`: Nuisance dedupe or hole-punching would require too many dedupe/copy/hole-punch operations in an extent.
 * `scan_toxic_hash`: A scanned block has the same hash as a hash table entry that is marked toxic.
 * `scan_toxic_match`: A hash table entry points to a block that is discovered to be toxic.
 * `scan_twice`: Two references to the same block have been found in the hash table.
- * `scan_zero_compressed`: An extent that was compressed and contained only zero bytes was found.
- * `scan_zero_uncompressed`: A block that contained only zero bytes was found in an uncompressed extent.
+ * `scan_zero`: A data block containing only zero bytes was detected.

 scanf
 -----
@@ -365,9 +395,10 @@ scanf
 The `scanf` event group consists of operations related to `BeesContext::scan_forward`.  This is the entry point where `crawl` schedules new data for scanning.

 * `scanf_deferred_extent`: Two tasks attempted to scan the same extent at the same time, so one was deferred.
- * `scanf_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
+ * `scanf_eof`: Scan past EOF was attempted.
 * `scanf_extent`: A btrfs extent item was scanned.
 * `scanf_extent_ms`: Total thread-seconds spent scanning btrfs extent items.
+ * `scanf_no_fd`: References to a block from the hash table were found, but a FD could not be opened.
 * `scanf_total`: A logical byte range of a file was scanned.
 * `scanf_total_ms`: Total thread-seconds spent scanning logical byte ranges.

@@ -205,7 +205,7 @@ Other Gotchas

 * bees avoids the [slow backrefs kernel bug](btrfs-kernel.md) by
  measuring the time required to perform `LOGICAL_INO` operations.
-  If an extent requires over 0.1 kernel CPU seconds to perform a
+  If an extent requires over 5.0 kernel CPU seconds to perform a
  `LOGICAL_INO` ioctl, then bees blacklists the extent and avoids
  referencing it in future operations.  In most cases, fewer than 0.1%
  of extents in a filesystem must be avoided this way.  This results
@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
 About bees
 ----------

-bees is a block-oriented userspace deduplication agent designed for large
-btrfs filesystems.  It is an offline dedupe combined with an incremental
-data scan capability to minimize time data spends on disk from write
-to dedupe.
+bees is a block-oriented userspace deduplication agent designed to scale
+up to large btrfs filesystems.  It is an offline dedupe combined with
+an incremental data scan capability to minimize time data spends on disk
+from write to dedupe.

 Strengths
 ---------

- * Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
- * Daemon incrementally dedupes new data using btrfs tree search
+ * Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
+ * Daemon mode - incrementally dedupes new data as it appears
+ * Largest extents first - recover more free space during fixed maintenance windows
 * Works with btrfs compression - dedupe any combination of compressed and uncompressed files
- * Works around btrfs filesystem structure to free more disk space
+ * Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
 * Persistent hash table for rapid restart after shutdown
- * Whole-filesystem dedupe - including snapshots
 * Constant hash table size - no increased RAM usage if data set becomes larger
 * Works on live data - no scheduled downtime required
- * Automatic self-throttling based on system load
+ * Automatic self-throttling - reduces system load
+ * btrfs support - recovers more free space from btrfs than naive dedupers

 Weaknesses
 ----------

 * Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
- * Requires root privilege (or `CAP_SYS_ADMIN`)
- * First run may require temporary disk space for extent reorganization
+ * Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
 * [First run may increase metadata space usage if many snapshots exist](gotchas.md)
 * Constant hash table size - no decreased RAM usage if data set becomes smaller
 * btrfs only
@@ -46,7 +46,7 @@ Recommended Reading
 -------------------

 * [bees Gotchas](gotchas.md)
- * [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING
+ * [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
 * [bees vs. other btrfs features](btrfs-other.md)
 * [What to do when something goes wrong](wrong.md)

@@ -69,6 +69,6 @@ You can also use Github:
 Copyright & License
 -------------------

-Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
+Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.

 GPL (version 3 or later).
@@ -15,16 +15,9 @@ specific files (patches welcome).
 * PREALLOC extents and extents containing blocks filled with zeros will
 be replaced by holes.  There is no way to turn this off.

-* Consecutive runs of duplicate blocks that are less than 12K in length
-can take 30% of the processing time while saving only 3% of the disk
-space.  There should be an option to just not bother with those, but it's
-complicated by the btrfs requirement to always dedupe complete extents.
-
-* There is a lot of duplicate reading of blocks in snapshots.  bees will
-scan all snapshots at close to the same time to try to get better
-performance by caching, but really fixing this requires rewriting the
-crawler to scan the btrfs extent tree directly instead of the subvol
-FS trees.
+* The fundamental unit of deduplication is the extent _reference_, when
+it should be the _extent_ itself.  This is an architectural limitation
+that results in excess reads of extent data, even in the Extent scan mode.

 * Block reads are currently more allocation- and CPU-intensive than they
 should be, especially for filesystems on SSD where the IO overhead is
@@ -33,8 +26,9 @@ much smaller.  This is a problem for CPU-power-constrained environments

 * bees can currently fragment extents when required to remove duplicate
 blocks, but has no defragmentation capability yet.  When possible, bees
-will attempt to work with existing extent boundaries, but it will not
-aggregate blocks together from multiple extents to create larger ones.
+will attempt to work with existing extent boundaries and choose the
+largest fragments available, but it will not aggregate blocks together
+from multiple extents to create larger ones.

 * When bees fragments an extent, the copied data is compressed.  There
 is currently no way (other than by modifying the source) to select a
@@ -36,6 +36,34 @@

 Has no effect unless `--loadavg-target` is used to specify a target load.

+* `--throttle-factor FACTOR`
+
+ In order to avoid saturating btrfs deferred work queues, bees tracks
+ the time that operations with delayed effect (dedupe and tmpfile copy)
+ and operations with long run times (`LOGICAL_INO`) run.  If an operation
+ finishes before the average run time for that operation, bees will
+ sleep for the remainder of the average run time, so that operations
+ are submitted to btrfs at a rate similar to the rate that btrfs can
+ complete them.
+
+ The `FACTOR` is multiplied by the average run time for each operation
+ to calculate the target delay time.
+
+ `FACTOR` 0 is the default, which adds no delays.  bees will attempt
+ to saturate btrfs delayed work queues as quickly as possible, which
+ may impact other processes on the same filesystem, or even slow down
+ bees itself.
+
+ `FACTOR` 1.0 will attempt to keep btrfs delayed work queues filled at
+ a steady average rate.
+
+ `FACTOR` more than 1.0 will add delays longer than the average
+ run time (e.g. 10.0 will delay all operations that take less than 10x
+ the average run time).  High values of `FACTOR` may be desirable when
+ using bees with other applications on the same filesystem.
+
+ The maximum delay per operation is 60 seconds.
+
 ## Filesystem tree traversal options

 * `--scan-mode MODE` or `-m`
@@ -47,6 +75,7 @@
  * Mode 1: independent
  * Mode 2: sequential
  * Mode 3: recent
+  * Mode 4: extent

 For details of the different scanning modes and the default value of
 this option, see [bees configuration](config.md).
@@ -55,19 +84,22 @@

 * `--workaround-btrfs-send` or `-a`

+ _This option is obsolete and should not be used any more._
+
 Pretend that read-only snapshots are empty and silently discard any
-request to dedupe files referenced through them.  This is a workaround for
-[problems with the kernel implementation of `btrfs send` and `btrfs send
+request to dedupe files referenced through them.  This is a workaround
+for [problems with old kernels running `btrfs send` and `btrfs send
 -p`](btrfs-kernel.md) which make these btrfs features unusable with bees.

- This option should be used to avoid breaking `btrfs send` on the same
-filesystem.
+ This option was used to avoid breaking `btrfs send` on old kernels.
+ The affected kernels are now too old to be recommended for use with bees.
+
+ bees now waits for `btrfs send` to finish.  There is no need for an
+ option to enable this.

 **Note:** There is a _significant_ space tradeoff when using this option:
 it is likely no space will be recovered--and possibly significant extra
-space used--until the read-only snapshots are deleted.  On the other
-hand, if snapshots are rotated frequently then bees will spend less time
-scanning them.
+space used--until the read-only snapshots are deleted.

 ## Logging options

@@ -75,9 +75,8 @@ in the shell script that launches `bees`:
        schedtool -D -n20 $$
        ionice -c3 -p $$

-You can also use the [`--loadavg-target` and `--thread-min`
-options](options.md) to further control the impact of bees on the rest
-of the system.
+You can also use the [load management options](options.md) to further
+control the impact of bees on the rest of the system.

 Let the bees fly:

@@ -4,16 +4,13 @@ What to do when something goes wrong with bees
 Hangs and excessive slowness
 ----------------------------

-### Are you using qgroups or autodefrag?
-
-  Read about [bad btrfs feature interactions](btrfs-other.md).
-
 ### Use load-throttling options

  If bees is just more aggressive than you would like, consider using
  [load throttling options](options.md).  These are usually more effective
  than `ionice`, `schedtool`, and the `blkio` cgroup (though you can
-  certainly use those too).
+  certainly use those too) because they limit work that bees queues up
+  for later execution inside btrfs.

 ### Check `$BEESSTATUS`

@@ -52,10 +49,6 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li

 Thread names of note:

- * `crawl_12345`: scan/dedupe worker threads (the number is the subvol
-   ID which the thread is currently working on).  These threads appear
-   and disappear from the status dynamically according to the requirements
-   of the work queue and loadavg throttling.
 * `bees`: main thread (doesn't do anything after startup, but its task execution time is that of the whole bees process)
 * `crawl_master`: task that finds new extents in the filesystem and populates the work queue
 * `crawl_transid`: btrfs transid (generation number) tracker and polling thread
@@ -64,6 +57,13 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li
 * `hash_writeback`: trickle-writes the hash table back to `beeshash.dat`
 * `hash_prefetch`: prefetches the hash table at startup and updates `beesstats.txt` hourly

+Most other threads have names that are derived from the current dedupe
+task that they are executing:
+
+ * `ref_205ad76b1000_24K_50`:  extent scan performing dedupe of btrfs extent bytenr `205ad76b1000`, which is 24 KiB long and has 50 references
+ * `extent_250_32M_16E`:  extent scan searching for extents between 32 MiB + 1 and 16 EiB bytes long, tracking scan position in virtual subvol `250`.
+ * `crawl_378_18916`:  subvol scan searching for extent refs in subvol `378`, inode `18916`.
+
 ### Dump kernel stacks of hung processes

 Check the kernel stacks of all blocked kernel processes:
@@ -91,7 +91,7 @@ bees Crashes
        (gdb) thread apply all bt full

  The last line generates megabytes of output and will often crash gdb.
-  This is OK, submit whatever output gdb can produce.
+  Submit whatever output gdb can produce.

  **Note that this output may include filenames or data from your
  filesystem.**
@@ -160,8 +160,7 @@ Kernel crashes, corruption, and filesystem damage
 -------------------------------------------------

 bees doesn't do anything that _should_ cause corruption or data loss;
-however, [btrfs has kernel bugs](btrfs-kernel.md) and [interacts poorly
-with some Linux block device layers](btrfs-other.md), so corruption is
+however, [btrfs has kernel bugs](btrfs-kernel.md), so corruption is
 not impossible.

 Issues with the btrfs filesystem kernel code or other block device layers
@@ -64,11 +64,13 @@ namespace crucible {
 		/// @{ Extent items (EXTENT_ITEM)
 		uint64_t extent_begin() const;
 		uint64_t extent_end() const;
+		uint64_t extent_flags() const;
 		uint64_t extent_generation() const;
 		/// @}

 		/// @{ Root items
 		uint64_t root_flags() const;
+		uint64_t root_refs() const;
 		/// @}

 		/// @{ Root backref items.
@@ -108,7 +110,9 @@ namespace crucible {
 		virtual ~BtrfsTreeFetcher() = default;
 		BtrfsTreeFetcher(Fd new_fd);
 		void type(uint8_t type);
+		uint8_t type();
 		void tree(uint64_t tree);
+		uint64_t tree();
 		void transid(uint64_t min_transid, uint64_t max_transid = numeric_limits<uint64_t>::max());
 		/// Block size (sectorsize) of filesystem
 		uint64_t block_size() const;
@@ -169,34 +173,42 @@ namespace crucible {
 		void get_sums(uint64_t logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t count)> output);
 	};

-	/// Fetch extent items from extent tree
+	/// Fetch extent items from extent tree.
+	/// Does not filter out metadata!  See BtrfsDataExtentTreeFetcher for that.
 	class BtrfsExtentItemFetcher : public BtrfsTreeObjectFetcher {
 	public:
 		BtrfsExtentItemFetcher(const Fd &fd);
 	};

-	/// Fetch extent refs from an inode
+	/// Fetch extent refs from an inode.  Caller must set the tree and objectid.
 	class BtrfsExtentDataFetcher : public BtrfsTreeOffsetFetcher {
 	public:
 		BtrfsExtentDataFetcher(const Fd &fd);
 	};

-	/// Fetch inodes from a subvol
-	class BtrfsFsTreeFetcher : public BtrfsTreeObjectFetcher {
-	public:
-		BtrfsFsTreeFetcher(const Fd &fd, uint64_t subvol);
-	};
-
+	/// Fetch raw inode items
 	class BtrfsInodeFetcher : public BtrfsTreeObjectFetcher {
 	public:
 		BtrfsInodeFetcher(const Fd &fd);
 		BtrfsTreeItem stat(uint64_t subvol, uint64_t inode);
 	};

+	/// Fetch a root (subvol) item
 	class BtrfsRootFetcher : public BtrfsTreeObjectFetcher {
 	public:
 		BtrfsRootFetcher(const Fd &fd);
 		BtrfsTreeItem root(uint64_t subvol);
+		BtrfsTreeItem root_backref(uint64_t subvol);
+	};
+
+	/// Fetch data extent items from extent tree, skipping metadata-only block groups
+	class BtrfsDataExtentTreeFetcher : public BtrfsExtentItemFetcher {
+		BtrfsTreeItem		m_current_bg;
+		BtrfsTreeOffsetFetcher	m_chunk_tree;
+	protected:
+		virtual void next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr) override;
+	public:
+		BtrfsDataExtentTreeFetcher(const Fd &fd);
 	};

 }
@@ -78,9 +78,6 @@ enum btrfs_compression_type {
 	#define BTRFS_SHARED_BLOCK_REF_KEY      182
 	#define BTRFS_SHARED_DATA_REF_KEY       184
 	#define BTRFS_BLOCK_GROUP_ITEM_KEY 192
-	#define BTRFS_FREE_SPACE_INFO_KEY 198
-	#define BTRFS_FREE_SPACE_EXTENT_KEY 199
-	#define BTRFS_FREE_SPACE_BITMAP_KEY 200
 	#define BTRFS_DEV_EXTENT_KEY    204
 	#define BTRFS_DEV_ITEM_KEY      216
 	#define BTRFS_CHUNK_ITEM_KEY    228
@@ -97,6 +94,18 @@ enum btrfs_compression_type {

 #endif

+#ifndef BTRFS_FREE_SPACE_INFO_KEY
+	#define BTRFS_FREE_SPACE_INFO_KEY 198
+	#define BTRFS_FREE_SPACE_EXTENT_KEY 199
+	#define BTRFS_FREE_SPACE_BITMAP_KEY 200
+	#define BTRFS_FREE_SPACE_OBJECTID -11ULL
+#endif
+
+#ifndef BTRFS_BLOCK_GROUP_RAID1C4
+	#define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
+	#define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
+#endif
+
 #ifndef BTRFS_DEFRAG_RANGE_START_IO

 	// For some reason uapi has BTRFS_DEFRAG_RANGE_COMPRESS and
@@ -55,7 +55,6 @@ namespace crucible {
 		Pointer m_ptr;
 		size_t m_size = 0;
 		mutable mutex m_mutex;
-	friend ostream & operator<<(ostream &os, const ByteVector &bv);
 	};

 	template <class T>
@@ -74,6 +73,8 @@ namespace crucible {
 		THROW_CHECK2(out_of_range, size(), sizeof(T), size() >= sizeof(T));
 		return reinterpret_cast<T*>(data());
 	}
+
+	ostream& operator<<(ostream &os, const ByteVector &bv);
 }

 #endif // _CRUCIBLE_BYTEVECTOR_H_
@@ -197,11 +197,17 @@ namespace crucible {

 		size_t m_buf_size;
 		set<BtrfsIoctlSearchHeader> m_result;
+
+		static thread_local size_t s_calls;
+		static thread_local size_t s_loops;
+		static thread_local size_t s_loops_empty;
+		static thread_local shared_ptr<ostream> s_debug_ostream;
 	};

 	ostream & operator<<(ostream &os, const btrfs_ioctl_search_key &key);
 	ostream & operator<<(ostream &os, const BtrfsIoctlSearchKey &key);

+	string btrfs_chunk_type_ntoa(uint64_t type);
 	string btrfs_search_type_ntoa(unsigned type);
 	string btrfs_search_objectid_ntoa(uint64_t objectid);
 	string btrfs_compress_type_ntoa(uint8_t type);
@@ -239,14 +245,14 @@ namespace crucible {
 		unsigned long available() const;
 	};

-	template<class V> ostream &hexdump(ostream &os, const V &v);
-
 	struct BtrfsIoctlFsInfoArgs : public btrfs_ioctl_fs_info_args_v3 {
 		BtrfsIoctlFsInfoArgs();
 		void do_ioctl(int fd);
+		bool do_ioctl_nothrow(int fd);
 		uint16_t csum_type() const;
 		uint16_t csum_size() const;
 		uint64_t generation() const;
+		vector<uint8_t> fsid() const;
 	};

 	ostream & operator<<(ostream &os, const BtrfsIoctlFsInfoArgs &a);
@@ -12,12 +12,14 @@ namespace crucible {
 	ostream &
 	hexdump(ostream &os, const V &v)
 	{
-		os << "V { size = " << v.size() << ", data:\n";
-		for (size_t i = 0; i < v.size(); i += 8) {
+		const auto v_size = v.size();
+		const uint8_t* const v_data = reinterpret_cast<const uint8_t*>(v.data());
+		os << "V { size = " << v_size << ", data:\n";
+		for (size_t i = 0; i < v_size; i += 8) {
 			string hex, ascii;
 			for (size_t j = i; j < i + 8; ++j) {
-				if (j < v.size()) {
-					uint8_t c = v[j];
+				if (j < v_size) {
+					const uint8_t c = v_data[j];
 					char buf[8];
 					sprintf(buf, "%02x ", c);
 					hex += buf;
@@ -117,7 +117,7 @@ namespace crucible {
 		while (full() || locked(name)) {
 			m_condvar.wait(lock);
 		}
-		auto rv = m_set.insert(make_pair(name, crucible::gettid()));
+		auto rv = m_set.insert(make_pair(name, gettid()));
 		THROW_CHECK0(runtime_error, rv.second);
 	}

@@ -129,7 +129,7 @@ namespace crucible {
 		if (full() || locked(name)) {
 			return false;
 		}
-		auto rv = m_set.insert(make_pair(name, crucible::gettid()));
+		auto rv = m_set.insert(make_pair(name, gettid()));
 		THROW_CHECK1(runtime_error, name, rv.second);
 		return true;
 	}
@@ -14,6 +14,7 @@ namespace crucible {
 		mutex m_mutex;
 		condition_variable m_cv;
 		map<string, size_t> m_counters;
+		bool m_do_locking = true;

 		class LockHandle {
 			const string m_type;
@@ -33,6 +34,7 @@ namespace crucible {
 		shared_ptr<LockHandle> get_lock_private(const string &type);
 	public:
 		static shared_ptr<LockHandle> get_lock(const string &type);
+		static void enable_locking(bool enabled);
 	};

 }
@@ -0,0 +1,52 @@
+#ifndef CRUCIBLE_OPENAT2_H
+#define CRUCIBLE_OPENAT2_H
+
+#include <cstdlib>
+
+// Compatibility for building on old libc for new kernel
+#include <linux/version.h>
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0)
+
+#include <linux/openat2.h>
+
+#else
+
+#include <linux/types.h>
+
+#ifndef RESOLVE_NO_XDEV
+#define RESOLVE_NO_XDEV 1
+
+// RESOLVE_NO_XDEV was there from the beginning of openat2,
+// so if that's missing, so is open_how
+
+struct open_how {
+	__u64 flags;
+	__u64 mode;
+	__u64 resolve;
+};
+#endif
+
+#ifndef RESOLVE_NO_MAGICLINKS
+#define RESOLVE_NO_MAGICLINKS 2
+#endif
+#ifndef RESOLVE_NO_SYMLINKS
+#define RESOLVE_NO_SYMLINKS 4
+#endif
+#ifndef RESOLVE_BENEATH
+#define RESOLVE_BENEATH 8
+#endif
+#ifndef RESOLVE_IN_ROOT
+#define RESOLVE_IN_ROOT 16
+#endif
+
+#endif // Linux version >= v5.6
+
+extern "C" {
+
+/// Weak symbol to support libc with no syscall wrapper
+int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size) throw();
+
+};
+
+#endif // CRUCIBLE_OPENAT2_H
@@ -10,6 +10,10 @@
 #include <sys/wait.h>
 #include <unistd.h>

+extern "C" {
+	pid_t gettid() throw();
+};
+
 namespace crucible {
 	using namespace std;

@@ -73,7 +77,6 @@ namespace crucible {

 	typedef ResourceHandle<Process::id, Process> Pid;

-	pid_t gettid();
 	double getloadavg1();
 	double getloadavg5();
 	double getloadavg15();
@@ -6,23 +6,23 @@
 #include <algorithm>
 #include <limits>

-#include <cstdint>
-
-#if 1
+// Debug stream
+#include <memory>
 #include <iostream>
 #include <sstream>
-#define DINIT(__x) __x
-#define DLOG(__x) do { logs << __x << std::endl; } while (false)
-#define DOUT(__err) do { __err << logs.str(); } while (false)
-#else
-#define DINIT(__x) do {} while (false)
-#define DLOG(__x) do {} while (false)
-#define DOUT(__x) do {} while (false)
-#endif
+
+#include <cstdint>

 namespace crucible {
 	using namespace std;

+	extern thread_local shared_ptr<ostream> tl_seeker_debug_str;
+	#define SEEKER_DEBUG_LOG(__x) do { \
+		if (tl_seeker_debug_str) { \
+			(*tl_seeker_debug_str) << __x << "\n"; \
+		} \
+	} while (false)
+
 	// Requirements for Container<Pos> Fetch(Pos lower, Pos upper):
 	// - fetches objects in Pos order, starting from lower (must be >= lower)
 	// - must return upper if present, may or may not return objects after that
@@ -49,113 +49,108 @@ namespace crucible {
 	Pos
 	seek_backward(Pos const target_pos, Fetch fetch, Pos min_step = 1, size_t max_loops = numeric_limits<size_t>::max())
 	{
-		DINIT(ostringstream logs);
-		try {
-			static const Pos end_pos = numeric_limits<Pos>::max();
-			// TBH this probably won't work if begin_pos != 0, i.e. any signed type
-			static const Pos begin_pos = numeric_limits<Pos>::min();
-			// Run a binary search looking for the highest key below target_pos.
-			// Initial upper bound of the search is target_pos.
-			// Find initial lower bound by doubling the size of the range until a key below target_pos
-			// is found, or the lower bound reaches the beginning of the search space.
-			// If the lower bound search reaches the beginning of the search space without finding a key,
-			// return the beginning of the search space; otherwise, perform a binary search between
-			// the bounds now established.
-			Pos lower_bound = 0;
-			Pos upper_bound = target_pos;
-			bool found_low = false;
-			Pos probe_pos = target_pos;
-			// We need one loop for each bit of the search space to find the lower bound,
-			// one loop for each bit of the search space to find the upper bound,
-			// and one extra loop to confirm the boundary is correct.
-			for (size_t loop_count = min(numeric_limits<Pos>::digits * size_t(2) + 1, max_loops); loop_count; --loop_count) {
-				DLOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
-				auto result = fetch(probe_pos, target_pos);
-				const Pos low_pos = result.empty() ? end_pos : *result.begin();
-				const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
-				DLOG(" = " << low_pos << ".." << high_pos);
-				// check for correct behavior of the fetch function
-				THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
-				THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
-				THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
-				if (!found_low) {
-					// if target_pos == end_pos then we will find it in every empty result set,
-					// so in that case we force the lower bound to be lower than end_pos
-					if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
-						// found a lower bound, set the low bound there and switch to binary search
-						found_low = true;
-						lower_bound = low_pos;
-						DLOG("found_low = true, lower_bound = " << lower_bound);
-					} else {
-						// still looking for lower bound
-						// if probe_pos was begin_pos then we can stop with no result
-						if (probe_pos == begin_pos) {
-							DLOG("return: probe_pos == begin_pos " << begin_pos);
-							return begin_pos;
-						}
-						// double the range size, or use the distance between objects found so far
-						THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
-						// already checked low_pos <= high_pos above
-						const Pos want_delta = max(upper_bound - probe_pos, min_step);
-						// avoid underflowing the beginning of the search space
-						const Pos have_delta = min(want_delta, probe_pos - begin_pos);
-						THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
-						// move probe and try again
-						probe_pos = probe_pos - have_delta;
-						DLOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
-						continue;
+		static const Pos end_pos = numeric_limits<Pos>::max();
+		// TBH this probably won't work if begin_pos != 0, i.e. any signed type
+		static const Pos begin_pos = numeric_limits<Pos>::min();
+		// Run a binary search looking for the highest key below target_pos.
+		// Initial upper bound of the search is target_pos.
+		// Find initial lower bound by doubling the size of the range until a key below target_pos
+		// is found, or the lower bound reaches the beginning of the search space.
+		// If the lower bound search reaches the beginning of the search space without finding a key,
+		// return the beginning of the search space; otherwise, perform a binary search between
+		// the bounds now established.
+		Pos lower_bound = 0;
+		Pos upper_bound = target_pos;
+		bool found_low = false;
+		Pos probe_pos = target_pos;
+		// We need one loop for each bit of the search space to find the lower bound,
+		// one loop for each bit of the search space to find the upper bound,
+		// and one extra loop to confirm the boundary is correct.
+		for (size_t loop_count = min((1 + numeric_limits<Pos>::digits) * size_t(2), max_loops); loop_count; --loop_count) {
+			SEEKER_DEBUG_LOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
+			auto result = fetch(probe_pos, target_pos);
+			const Pos low_pos = result.empty() ? end_pos : *result.begin();
+			const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
+			SEEKER_DEBUG_LOG(" = " << low_pos << ".." << high_pos);
+			// check for correct behavior of the fetch function
+			THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
+			THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
+			THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
+			if (!found_low) {
+				// if target_pos == end_pos then we will find it in every empty result set,
+				// so in that case we force the lower bound to be lower than end_pos
+				if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
+					// found a lower bound, set the low bound there and switch to binary search
+					found_low = true;
+					lower_bound = low_pos;
+					SEEKER_DEBUG_LOG("found_low = true, lower_bound = " << lower_bound);
+				} else {
+					// still looking for lower bound
+					// if probe_pos was begin_pos then we can stop with no result
+					if (probe_pos == begin_pos) {
+						SEEKER_DEBUG_LOG("return: probe_pos == begin_pos " << begin_pos);
+						return begin_pos;
 					}
+					// double the range size, or use the distance between objects found so far
+					THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
+					// already checked low_pos <= high_pos above
+					const Pos want_delta = max(upper_bound - probe_pos, min_step);
+					// avoid underflowing the beginning of the search space
+					const Pos have_delta = min(want_delta, probe_pos - begin_pos);
+					THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
+					// move probe and try again
+					probe_pos = probe_pos - have_delta;
+					SEEKER_DEBUG_LOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
+					continue;
 				}
-				if (low_pos <= target_pos && target_pos <= high_pos) {
-					// have keys on either side of target_pos in result
-					// search from the high end until we find the highest key below target
-					for (auto i = result.rbegin(); i != result.rend(); ++i) {
-						// more correctness checking for fetch
-						THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
-						if (*i <= target_pos) {
-							DLOG("return: *i " << *i << " <= target_pos " << target_pos);
-							return *i;
-						}
-					}
-					// if the list is empty then low_pos = high_pos = end_pos
-					// if target_pos = end_pos also, then we will execute the loop
-					// above but not find any matching entries.
-					THROW_CHECK0(runtime_error, result.empty());
-				}
-				if (target_pos <= low_pos) {
-					// results are all too high, so probe_pos..low_pos is too high
-					// lower the high bound to the probe pos
-					upper_bound = probe_pos;
-					DLOG("upper_bound = probe_pos " << probe_pos);
-				}
-				if (high_pos < target_pos) {
-					// results are all too low, so probe_pos..high_pos is too low
-					// raise the low bound to the high_pos
-					DLOG("lower_bound = high_pos " << high_pos);
-					lower_bound = high_pos;
-				}
-				// compute a new probe pos at the middle of the range and try again
-				// we can't have a zero-size range here because we would not have set found_low yet
-				THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
-				const Pos delta = (upper_bound - lower_bound) / 2;
-				probe_pos = lower_bound + delta;
-				if (delta < 1) {
-					// nothing can exist in the range (lower_bound, upper_bound)
-					// and an object is known to exist at lower_bound
-					DLOG("return: probe_pos == lower_bound " << lower_bound);
-					return lower_bound;
-				}
-				THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
-				THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
-				DLOG("loop: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
 			}
-			THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
-				"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
-				"found_low " << found_low);
-		} catch (...) {
-			DOUT(cerr);
-			throw;
+			if (low_pos <= target_pos && target_pos <= high_pos) {
+				// have keys on either side of target_pos in result
+				// search from the high end until we find the highest key below target
+				for (auto i = result.rbegin(); i != result.rend(); ++i) {
+					// more correctness checking for fetch
+					THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
+					if (*i <= target_pos) {
+						SEEKER_DEBUG_LOG("return: *i " << *i << " <= target_pos " << target_pos);
+						return *i;
+					}
+				}
+				// if the list is empty then low_pos = high_pos = end_pos
+				// if target_pos = end_pos also, then we will execute the loop
+				// above but not find any matching entries.
+				THROW_CHECK0(runtime_error, result.empty());
+			}
+			if (target_pos <= low_pos) {
+				// results are all too high, so probe_pos..low_pos is too high
+				// lower the high bound to the probe pos, low_pos cannot be lower
+				SEEKER_DEBUG_LOG("upper_bound = probe_pos " << probe_pos);
+				upper_bound = probe_pos;
+			}
+			if (high_pos < target_pos) {
+				// results are all too low, so probe_pos..high_pos is too low
+				// raise the low bound to high_pos but not above upper_bound
+				const auto next_pos = min(high_pos, upper_bound);
+				SEEKER_DEBUG_LOG("lower_bound = next_pos " << next_pos);
+				lower_bound = next_pos;
+			}
+			// compute a new probe pos at the middle of the range and try again
+			// we can't have a zero-size range here because we would not have set found_low yet
+			THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
+			const Pos delta = (upper_bound - lower_bound) / 2;
+			probe_pos = lower_bound + delta;
+			if (delta < 1) {
+				// nothing can exist in the range (lower_bound, upper_bound)
+				// and an object is known to exist at lower_bound
+				SEEKER_DEBUG_LOG("return: probe_pos == lower_bound " << lower_bound);
+				return lower_bound;
+			}
+			THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
+			THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
+			SEEKER_DEBUG_LOG("loop bottom: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
 		}
+		THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
+			"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
+			"found_low " << found_low);
 	}
 }

@@ -0,0 +1,106 @@
+#ifndef CRUCIBLE_TABLE_H
+#define CRUCIBLE_TABLE_H
+
+#include <functional>
+#include <limits>
+#include <map>
+#include <memory>
+#include <ostream>
+#include <sstream>
+#include <string>
+#include <vector>
+
+namespace crucible {
+	namespace Table {
+		using namespace std;
+
+		using Content = function<string(size_t width, size_t height)>;
+		const size_t endpos = numeric_limits<size_t>::max();
+
+		Content Fill(const char c);
+		Content Text(const string& s);
+
+		template <class T>
+		Content Number(const T& num)
+		{
+			ostringstream oss;
+			oss << num;
+			return Text(oss.str());
+		}
+
+		class Cell {
+			Content m_content;
+		public:
+			Cell(const Content &fn = [](size_t, size_t) { return string(); } );
+			Cell& operator=(const Content &fn);
+			string text(size_t width, size_t height) const;
+		};
+
+		class Dimension {
+			size_t m_next_pos = 0;
+			vector<size_t> m_elements;
+		friend class Table;
+			size_t at(size_t) const;
+		public:
+			size_t size() const;
+			size_t insert(size_t pos);
+			void erase(size_t pos);
+		};
+
+		class Table {
+			Dimension m_rows, m_cols;
+			map<pair<size_t, size_t>, Cell> m_cells;
+			string m_left = "|";
+			string m_mid = "|";
+			string m_right = "|";
+		public:
+			Dimension &rows();
+			const Dimension& rows() const;
+			Dimension &cols();
+			const Dimension& cols() const;
+			Cell& at(size_t row, size_t col);
+			const Cell& at(size_t row, size_t col) const;
+			template <class T> void insert_row(size_t pos, const T& container);
+			template <class T> void insert_col(size_t pos, const T& container);
+			void left(const string &s);
+			void mid(const string &s);
+			void right(const string &s);
+			const string& left() const;
+			const string& mid() const;
+			const string& right() const;
+		};
+
+		ostream& operator<<(ostream &os, const Table &table);
+
+		template <class T>
+		void
+		Table::insert_row(size_t pos, const T& container)
+		{
+			const auto new_pos = m_rows.insert(pos);
+			size_t col = 0;
+			for (const auto &i : container) {
+				if (col >= cols().size()) {
+					cols().insert(col);
+				}
+				at(new_pos, col++) = i;
+			}
+		}
+
+		template <class T>
+		void
+		Table::insert_col(size_t pos, const T& container)
+		{
+			const auto new_pos = m_cols.insert(pos);
+			size_t row = 0;
+			for (const auto &i : container) {
+				if (row >= rows().size()) {
+					rows().insert(row);
+				}
+				at(row++, new_pos) = i;
+			}
+		}
+
+	}
+}
+
+#endif // CRUCIBLE_TABLE_H
@@ -40,10 +40,17 @@ namespace crucible {
 		/// after the current instance exits.
 		void run() const;

+		/// Schedule task to run when no other Task is available.
+		void idle() const;
+
 		/// Schedule Task to run after this Task has run or
 		/// been destroyed.
 		void append(const Task &task) const;

+		/// Schedule Task to run after this Task has run or
+		/// been destroyed, in Task ID order.
+		void insert(const Task &task) const;
+
 		/// Describe Task as text.
 		string title() const;

@@ -163,15 +170,12 @@ namespace crucible {
 		/// (it is the ExclusionLock that owns the lock, so it can
 		/// be passed to other Tasks or threads, but this is not
 		/// recommended practice).
-		/// If not successful, current Task is appended to the
+		/// If not successful, the argument Task is appended to the
 		/// task that currently holds the lock.  Current task is
-		/// expected to release any other ExclusionLock
+		/// expected to immediately release any other ExclusionLock
 		/// objects it holds, and exit its Task function.
 		ExclusionLock try_lock(const Task &task);

-		/// Execute Task when Exclusion is unlocked (possibly
-		/// immediately).
-		void insert_task(const Task &t);
 	};

 	/// Wrapper around pthread_setname_np which handles length limits
@@ -34,7 +34,7 @@ namespace crucible {
 		double	m_rate;
 		double	m_burst;
 		double  m_tokens = 0.0;
-		mutex	m_mutex;
+		mutable mutex m_mutex;

 		void update_tokens();
 		RateLimiter() = delete;
@@ -45,6 +45,8 @@ namespace crucible {
 		double sleep_time(double cost = 1.0);
 		bool is_ready();
 		void borrow(double cost = 1.0);
+		void rate(double new_rate);
+		double rate() const;
 	};

 	class RateEstimator {
@@ -88,6 +90,9 @@ namespace crucible {
 		// Read count
 		uint64_t count() const;

+		/// Increment count (like update(count() + more), but atomic)
+		void increment(uint64_t more = 1);
+
 		// Convert counts to chrono types
 		chrono::high_resolution_clock::time_point time_point(uint64_t absolute_count) const;
 		chrono::duration<double> duration(uint64_t relative_count) const;
@@ -14,9 +14,12 @@ CRUCIBLE_OBJS = \
 	fs.o \
 	multilock.o \
 	ntoa.o \
+	openat2.o \
 	path.o \
 	process.o \
+	seeker.o \
 	string.o \
+	table.o \
 	task.o \
 	time.o \
 	uname.o \
@@ -5,6 +5,12 @@
 #include "crucible/hexdump.h"
 #include "crucible/seeker.h"

+#define CRUCIBLE_BTRFS_TREE_DEBUG(x) do { \
+	if (BtrfsIoctlSearchKey::s_debug_ostream) { \
+		(*BtrfsIoctlSearchKey::s_debug_ostream) << x; \
+	} \
+} while (false)
+
 namespace crucible {
 	using namespace std;

@@ -22,6 +28,13 @@ namespace crucible {
 		return m_objectid + m_offset;
 	}

+	uint64_t
+	BtrfsTreeItem::extent_flags() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_ITEM_KEY);
+		return btrfs_get_member(&btrfs_extent_item::flags, m_data);
+	}
+
 	uint64_t
 	BtrfsTreeItem::extent_generation() const
 	{
@@ -61,6 +74,13 @@ namespace crucible {
 		return btrfs_get_member(&btrfs_root_item::flags, m_data);
 	}

+	uint64_t
+	BtrfsTreeItem::root_refs() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_ITEM_KEY);
+		return btrfs_get_member(&btrfs_root_item::refs, m_data);
+	}
+
 	ostream &
 	operator<<(ostream &os, const BtrfsTreeItem &bti)
 	{
@@ -269,12 +289,24 @@ namespace crucible {
 		m_type = type;
 	}

+	uint8_t
+	BtrfsTreeFetcher::type()
+	{
+		return m_type;
+	}
+
 	void
 	BtrfsTreeFetcher::tree(uint64_t tree)
 	{
 		m_tree = tree;
 	}

+	uint64_t
+	BtrfsTreeFetcher::tree()
+	{
+		return m_tree;
+	}
+
 	void
 	BtrfsTreeFetcher::transid(uint64_t min_transid, uint64_t max_transid)
 	{
@@ -329,6 +361,7 @@ namespace crucible {
 	BtrfsTreeItem
 	BtrfsTreeFetcher::at(uint64_t logical)
 	{
+		CRUCIBLE_BTRFS_TREE_DEBUG("at " << logical);
 		BtrfsIoctlSearchKey &sk = m_sk;
 		fill_sk(sk, logical);
 		// Exact match, should return 0 or 1 items
@@ -371,53 +404,59 @@ namespace crucible {
 	BtrfsTreeFetcher::rlower_bound(uint64_t logical)
 	{
 	#if 0
-	#define BTFRLB_DEBUG(x) do { cerr << x; } while (false)
+		static bool btfrlb_debug = getenv("BTFLRB_DEBUG");
+	#define BTFRLB_DEBUG(x) do { if (btfrlb_debug) cerr << x; } while (false)
 	#else
-	#define BTFRLB_DEBUG(x) do { } while (false)
+	#define BTFRLB_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
 	#endif
 		BtrfsTreeItem closest_item;
 		uint64_t closest_logical = 0;
 		BtrfsIoctlSearchKey &sk = m_sk;
 		size_t loops = 0;
-		BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << endl);
-		seek_backward(scale_logical(logical), [&](uint64_t lower_bound, uint64_t upper_bound) {
+		BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << " in tree " << tree() << endl);
+		seek_backward(scale_logical(logical), [&](uint64_t const lower_bound, uint64_t const upper_bound) {
 			++loops;
 			fill_sk(sk, unscale_logical(min(scaled_max_logical(), lower_bound)));
 			set<uint64_t> rv;
+			bool too_far = false;
 			do {
 				sk.nr_items = 4;
 				sk.do_ioctl(fd());
 				BTFRLB_DEBUG("fetch: loop " << loops << " lower_bound..upper_bound " << to_hex(lower_bound) << ".." << to_hex(upper_bound));
 				for (auto &i : sk.m_result) {
 					next_sk(sk, i);
-					const auto this_logical = hdr_logical(i);
-					const auto scaled_hdr_logical = scale_logical(this_logical);
-					BTFRLB_DEBUG(" " << to_hex(scaled_hdr_logical));
-					if (hdr_match(i)) {
-						if (this_logical <= logical && this_logical > closest_logical) {
-							closest_logical = this_logical;
-							closest_item = i;
-						}
-						BTFRLB_DEBUG("(match)");
-						rv.insert(scaled_hdr_logical);
-					}
-					if (scaled_hdr_logical > upper_bound || hdr_stop(i)) {
-						if (scaled_hdr_logical >= upper_bound) {
-							BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
-						}
-						if (hdr_stop(i)) {
-							rv.insert(numeric_limits<uint64_t>::max());
-							BTFRLB_DEBUG("(stop)");
-						}
+					// If hdr_stop or !hdr_match, don't inspect the item
+					if (hdr_stop(i)) {
+						too_far = true;
+						rv.insert(numeric_limits<uint64_t>::max());
+						BTFRLB_DEBUG("(stop)");
 						break;
-					} else {
-						BTFRLB_DEBUG("(cont'd)");
 					}
+					if (!hdr_match(i)) {
+						BTFRLB_DEBUG("(no match)");
+						continue;
+					}
+					const auto this_logical = hdr_logical(i);
+					BTFRLB_DEBUG(" " << to_hex(this_logical) << " " << i);
+					const auto scaled_hdr_logical = scale_logical(this_logical);
+					BTFRLB_DEBUG(" " << "(match)");
+					if (scaled_hdr_logical > upper_bound) {
+						too_far = true;
+						BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
+						break;
+					}
+					if (this_logical <= logical && this_logical > closest_logical) {
+						closest_logical = this_logical;
+						closest_item = i;
+						BTFRLB_DEBUG("(closest)");
+					}
+					rv.insert(scaled_hdr_logical);
+					BTFRLB_DEBUG("(cont'd)");
 				}
 				BTFRLB_DEBUG(endl);
 				// We might get a search result that contains only non-matching items.
 				// Keep looping until we find any matching item or we run out of tree.
-			} while (rv.empty() && !sk.m_result.empty());
+			} while (!too_far && rv.empty() && !sk.m_result.empty());
 			return rv;
 		}, scale_logical(lookbehind_size()));
 		return closest_item;
@@ -448,6 +487,7 @@ namespace crucible {
 	BtrfsTreeItem
 	BtrfsTreeFetcher::next(uint64_t logical)
 	{
+		CRUCIBLE_BTRFS_TREE_DEBUG("next " << logical);
 		const auto scaled_logical = scale_logical(logical);
 		if (scaled_logical + 1 > scaled_max_logical()) {
 			return BtrfsTreeItem();
@@ -458,6 +498,7 @@ namespace crucible {
 	BtrfsTreeItem
 	BtrfsTreeFetcher::prev(uint64_t logical)
 	{
+		CRUCIBLE_BTRFS_TREE_DEBUG("prev " << logical);
 		const auto scaled_logical = scale_logical(logical);
 		if (scaled_logical < 1) {
 			return BtrfsTreeItem();
@@ -542,9 +583,10 @@ namespace crucible {
 	BtrfsCsumTreeFetcher::get_sums(uint64_t const logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t bytes)> output)
 	{
 	#if 0
-	#define BCTFGS_DEBUG(x) do { cerr << x; } while (false)
+		static bool bctfgs_debug = getenv("BCTFGS_DEBUG");
+	#define BCTFGS_DEBUG(x) do { if (bctfgs_debug) cerr << x; } while (false)
 	#else
-	#define BCTFGS_DEBUG(x) do { } while (false)
+	#define BCTFGS_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
 	#endif
 		const uint64_t logical_end = logical + count * block_size();
 		BtrfsTreeItem bti = rlower_bound(logical);
@@ -636,14 +678,6 @@ namespace crucible {
 		type(BTRFS_EXTENT_DATA_KEY);
 	}

-	BtrfsFsTreeFetcher::BtrfsFsTreeFetcher(const Fd &new_fd, uint64_t subvol) :
-		BtrfsTreeObjectFetcher(new_fd)
-	{
-		tree(subvol);
-		type(BTRFS_EXTENT_DATA_KEY);
-		scale_size(1);
-	}
-
 	BtrfsInodeFetcher::BtrfsInodeFetcher(const Fd &fd) :
 		BtrfsTreeObjectFetcher(fd)
 	{
@@ -667,18 +701,86 @@ namespace crucible {
 		BtrfsTreeObjectFetcher(fd)
 	{
 		tree(BTRFS_ROOT_TREE_OBJECTID);
-		type(BTRFS_ROOT_ITEM_KEY);
 		scale_size(1);
 	}

 	BtrfsTreeItem
-	BtrfsRootFetcher::root(uint64_t subvol)
+	BtrfsRootFetcher::root(const uint64_t subvol)
 	{
+		const auto my_type = BTRFS_ROOT_ITEM_KEY;
+		type(my_type);
 		const auto item = at(subvol);
 		if (!!item) {
 			THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
-			THROW_CHECK2(runtime_error, item.type(), BTRFS_ROOT_ITEM_KEY, item.type() == BTRFS_ROOT_ITEM_KEY);
+			THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
 		}
 		return item;
 	}
+
+	BtrfsTreeItem
+	BtrfsRootFetcher::root_backref(const uint64_t subvol)
+	{
+		const auto my_type = BTRFS_ROOT_BACKREF_KEY;
+		type(my_type);
+		const auto item = at(subvol);
+		if (!!item) {
+			THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
+			THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
+		}
+		return item;
+	}
+
+	BtrfsDataExtentTreeFetcher::BtrfsDataExtentTreeFetcher(const Fd &fd) :
+		BtrfsExtentItemFetcher(fd),
+		m_chunk_tree(fd)
+	{
+		tree(BTRFS_EXTENT_TREE_OBJECTID);
+		type(BTRFS_EXTENT_ITEM_KEY);
+		m_chunk_tree.tree(BTRFS_CHUNK_TREE_OBJECTID);
+		m_chunk_tree.type(BTRFS_CHUNK_ITEM_KEY);
+		m_chunk_tree.objectid(BTRFS_FIRST_CHUNK_TREE_OBJECTID);
+	}
+
+	void
+	BtrfsDataExtentTreeFetcher::next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr)
+	{
+		key.min_type = key.max_type = type();
+		key.max_objectid = key.max_offset = numeric_limits<uint64_t>::max();
+		key.min_offset = 0;
+		key.min_objectid = hdr.objectid;
+		const auto step = scale_size();
+		if (key.min_objectid < numeric_limits<uint64_t>::max() - step) {
+			key.min_objectid += step;
+		} else {
+			key.min_objectid = numeric_limits<uint64_t>::max();
+		}
+		// If we're still in our current block group, check here
+		if (!!m_current_bg) {
+			const auto bg_begin = m_current_bg.offset();
+			const auto bg_end = bg_begin + m_current_bg.chunk_length();
+			// If we are still in our current block group, return early
+			if (key.min_objectid >= bg_begin && key.min_objectid < bg_end) return;
+		}
+		// We don't have a current block group or we're out of range
+		// Find the chunk that this bytenr belongs to
+		m_current_bg = m_chunk_tree.rlower_bound(key.min_objectid);
+		// Make sure it's a data block group
+		while (!!m_current_bg) {
+			// Data block group, stop here
+			if (m_current_bg.chunk_type() & BTRFS_BLOCK_GROUP_DATA) break;
+			// Not a data block group, skip to end
+			key.min_objectid = m_current_bg.offset() + m_current_bg.chunk_length();
+			m_current_bg = m_chunk_tree.lower_bound(key.min_objectid);
+		}
+		if (!m_current_bg) {
+			// Ran out of data block groups, stop here
+			return;
+		}
+		// Check to see if bytenr is in the current data block group
+		const auto bg_begin = m_current_bg.offset();
+		if (key.min_objectid < bg_begin) {
+			// Move forward to start of data block group
+			key.min_objectid = bg_begin;
+		}
+	}
 }
@@ -44,10 +44,10 @@ namespace crucible {
 	}

 	ByteVector::value_type&
-	ByteVector::operator[](size_t size) const
+	ByteVector::operator[](size_t index) const
 	{
 		unique_lock<mutex> lock(m_mutex);
-		return m_ptr.get()[size];
+		return m_ptr.get()[index];
 	}

 	ByteVector::ByteVector(const ByteVector &that)
@@ -183,7 +183,6 @@ namespace crucible {

 	ostream&
 	operator<<(ostream &os, const ByteVector &bv) {
-		unique_lock<mutex> lock(bv.m_mutex);
 		hexdump(os, bv);
 		return os;
 	}
@@ -76,7 +76,7 @@ namespace crucible {
 			DIE_IF_ZERO(strftime(buf, sizeof(buf), "%Y-%m-%d %H:%M:%S", &ltm));

 			header_stream << buf;
-			header_stream << " " << getpid() << "." << crucible::gettid();
+			header_stream << " " << getpid() << "." << gettid();
 			if (add_prefix_level) {
 				header_stream << "<" << m_loglevel << ">";
 			}
@@ -88,7 +88,7 @@ namespace crucible {
 				header_stream << "<" << m_loglevel << ">";
 			}
 			header_stream << (m_name.empty() ? "thread" : m_name);
-			header_stream << "[" << crucible::gettid() << "]";
+			header_stream << "[" << gettid() << "]";
 		}

 		header_stream << ": ";
@@ -159,12 +159,13 @@ namespace crucible {
 	{
 		THROW_CHECK1(invalid_argument, src_length, src_length > 0);
 		while (src_length > 0) {
-			off_t length = min(off_t(BTRFS_MAX_DEDUPE_LEN), src_length);
-			BtrfsExtentSame bes(src_fd, src_offset, length);
+			BtrfsExtentSame bes(src_fd, src_offset, src_length);
 			bes.add(dst_fd, dst_offset);
 			bes.do_ioctl();
-			auto status = bes.m_info.at(0).status;
+			const auto status = bes.m_info.at(0).status;
 			if (status == 0) {
+				const off_t length = bes.m_info.at(0).bytes_deduped;
+				THROW_CHECK0(invalid_argument, length > 0);
 				src_offset += length;
 				dst_offset += length;
 				src_length -= length;
@@ -333,7 +334,7 @@ namespace crucible {
 		btrfs_ioctl_logical_ino_args args = (btrfs_ioctl_logical_ino_args) {
 			.logical = m_logical,
 			.size = m_container_size,
-			.inodes = reinterpret_cast<uint64_t>(m_container.prepare(m_container_size)),
+			.inodes = reinterpret_cast<uintptr_t>(m_container.prepare(m_container_size)),
 		};
 		// We are still supporting building with old headers that don't have .flags yet
 		*(&args.reserved[0] + 3) = m_flags;
@@ -416,7 +417,7 @@ namespace crucible {
 	{
 		btrfs_ioctl_ino_path_args *p = static_cast<btrfs_ioctl_ino_path_args *>(this);
 		BtrfsDataContainer container(m_container_size);
-		fspath = reinterpret_cast<uint64_t>(container.prepare(m_container_size));
+		fspath = reinterpret_cast<uintptr_t>(container.prepare(m_container_size));
 		size = container.get_size();

 		m_paths.clear();
@@ -753,6 +754,11 @@ namespace crucible {
 		return offset + len;
 	}

+	thread_local size_t BtrfsIoctlSearchKey::s_calls = 0;
+	thread_local size_t BtrfsIoctlSearchKey::s_loops = 0;
+	thread_local size_t BtrfsIoctlSearchKey::s_loops_empty = 0;
+	thread_local shared_ptr<ostream> BtrfsIoctlSearchKey::s_debug_ostream;
+
 	bool
 	BtrfsIoctlSearchKey::do_ioctl_nothrow(int fd)
 	{
@@ -771,8 +777,17 @@ namespace crucible {
 			ioctl_ptr = ioctl_arg.get<btrfs_ioctl_search_args_v2>();
 			ioctl_ptr->key = static_cast<const btrfs_ioctl_search_key&>(*this);
 			ioctl_ptr->buf_size = buf_size;
+			if (s_debug_ostream) {
+				(*s_debug_ostream) << "bisk " << (ioctl_ptr->key) << "\n";
+			}
 			// Don't bother supporting V1.  Kernels that old have other problems.
 			int rv = ioctl(fd, BTRFS_IOC_TREE_SEARCH_V2, ioctl_arg.data());
+			++s_calls;
+			if (rv != 0 && errno == ENOENT) {
+				// If we are searching a tree that is deleted or no longer exists, just return an empty list
+				ioctl_ptr->key.nr_items = 0;
+				break;
+			}
 			if (rv != 0 && errno != EOVERFLOW) {
 				return false;
 			}
@@ -794,6 +809,10 @@ namespace crucible {
 				buf_size *= 2;
 			}
 			// don't automatically raise the buf size higher than 64K, the largest possible btrfs item
+			++s_loops;
+			if (ioctl_ptr->key.nr_items == 0) {
+				++s_loops_empty;
+			}
 		} while (buf_size < 65536);

 		// ioctl changes nr_items, this has to be copied back
@@ -866,6 +885,26 @@ namespace crucible {
 		}
 	}

+	string
+	btrfs_chunk_type_ntoa(uint64_t type)
+	{
+		static const bits_ntoa_table table[] = {
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DATA),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_METADATA),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_SYSTEM),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DUP),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID0),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID10),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C3),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C4),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID5),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID6),
+			NTOA_TABLE_ENTRY_END()
+		};
+		return bits_ntoa(type, table);
+	}
+
 	string
 	btrfs_search_type_ntoa(unsigned type)
 	{
@@ -893,15 +932,9 @@ namespace crucible {
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_BLOCK_REF_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_DATA_REF_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_BLOCK_GROUP_ITEM_KEY),
-#ifdef BTRFS_FREE_SPACE_INFO_KEY
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_INFO_KEY),
-#endif
-#ifdef BTRFS_FREE_SPACE_EXTENT_KEY
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_EXTENT_KEY),
-#endif
-#ifdef BTRFS_FREE_SPACE_BITMAP_KEY
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_BITMAP_KEY),
-#endif
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_EXTENT_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_ITEM_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_CHUNK_ITEM_KEY),
@@ -933,9 +966,7 @@ namespace crucible {
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_CSUM_TREE_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_QUOTA_TREE_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_UUID_TREE_OBJECTID),
-#ifdef BTRFS_FREE_SPACE_TREE_OBJECTID
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_TREE_OBJECTID),
-#endif
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_BALANCE_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_ORPHAN_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_TREE_LOG_OBJECTID),
@@ -1123,11 +1154,17 @@ namespace crucible {
 	{
 	}

-	void
-	BtrfsIoctlFsInfoArgs::do_ioctl(int fd)
+	bool
+	BtrfsIoctlFsInfoArgs::do_ioctl_nothrow(int const fd)
 	{
 		btrfs_ioctl_fs_info_args_v3 *p = static_cast<btrfs_ioctl_fs_info_args_v3 *>(this);
-		if (ioctl(fd, BTRFS_IOC_FS_INFO, p)) {
+		return 0 == ioctl(fd, BTRFS_IOC_FS_INFO, p);
+	}
+
+	void
+	BtrfsIoctlFsInfoArgs::do_ioctl(int const fd)
+	{
+		if (!do_ioctl_nothrow(fd)) {
 			THROW_ERRNO("BTRFS_IOC_FS_INFO: fd " << fd);
 		}
 	}
@@ -1144,6 +1181,13 @@ namespace crucible {
 		return this->btrfs_ioctl_fs_info_args_v3::csum_size;
 	}

+	vector<uint8_t>
+	BtrfsIoctlFsInfoArgs::fsid() const
+	{
+		const auto begin = btrfs_ioctl_fs_info_args_v3::fsid;
+		return vector<uint8_t>(begin, begin + BTRFS_FSID_SIZE);
+	}
+
 	uint64_t
 	BtrfsIoctlFsInfoArgs::generation() const
 	{
@@ -62,11 +62,22 @@ namespace crucible {
 		return rv;
 	}

+	static MultiLocker s_process_instance;
+
 	shared_ptr<MultiLocker::LockHandle>
 	MultiLocker::get_lock(const string &type)
 	{
-		static MultiLocker s_process_instance;
-		return s_process_instance.get_lock_private(type);
+		if (s_process_instance.m_do_locking) {
+			return s_process_instance.get_lock_private(type);
+		} else {
+			return shared_ptr<MultiLocker::LockHandle>();
+		}
+	}
+
+	void
+	MultiLocker::enable_locking(const bool enabled)
+	{
+		s_process_instance.m_do_locking = enabled;
 	}

 }
@@ -0,0 +1,40 @@
+#include "crucible/openat2.h"
+
+#include <sys/syscall.h>
+
+// Compatibility for building on old libc for new kernel
+
+#if LINUX_VERSION_CODE < KERNEL_VERSION(5, 6, 0)
+
+// Every arch that defines this uses 437, except Alpha, where 437 is
+// mq_getsetattr.
+
+#ifndef SYS_openat2
+#ifdef __alpha__
+#define SYS_openat2 547
+#else
+#define SYS_openat2 437
+#endif
+#endif
+
+#endif // Linux version >= v5.6
+
+#include <fcntl.h>
+#include <unistd.h>
+
+extern "C" {
+
+int
+__attribute__((weak))
+openat2(int const dirfd, const char *const pathname, struct open_how *const how, size_t const size)
+throw()
+{
+#ifdef SYS_openat2
+	return syscall(SYS_openat2, dirfd, pathname, how, size);
+#else
+	errno = ENOSYS;
+	return -1;
+#endif
+}
+
+};
@@ -7,13 +7,18 @@
 #include <cstdlib>
 #include <utility>

-// for gettid()
-#ifndef _GNU_SOURCE
-#define _GNU_SOURCE
-#endif
 #include <unistd.h>
 #include <sys/syscall.h>

+extern "C" {
+	pid_t
+	__attribute__((weak))
+	gettid() throw()
+	{
+		return syscall(SYS_gettid);
+	}
+};
+
 namespace crucible {
 	using namespace std;

@@ -111,12 +116,6 @@ namespace crucible {
 		}
 	}

-	pid_t
-	gettid()
-	{
-		return syscall(SYS_gettid);
-	}
-
 	double
 	getloadavg1()
 	{
@@ -0,0 +1,7 @@
+#include "crucible/seeker.h"
+
+namespace crucible {
+
+	thread_local shared_ptr<ostream> tl_seeker_debug_str;
+
+};
@@ -0,0 +1,254 @@
+#include "crucible/table.h"
+
+#include "crucible/string.h"
+
+namespace crucible {
+	namespace Table {
+		using namespace std;
+
+		Content
+		Fill(const char c)
+		{
+			return [=](size_t width, size_t height) -> string {
+				string rv;
+				while (height--) {
+					rv += string(width, c);
+					if (height) {
+						rv += "\n";
+					}
+				}
+				return rv;
+			};
+		}
+
+		Content
+		Text(const string &s)
+		{
+			return [=](size_t width, size_t height) -> string {
+				const auto lines = split("\n", s);
+				string rv;
+				size_t line_count = 0;
+				for (const auto &i : lines) {
+					if (line_count++) {
+						rv += "\n";
+					}
+					if (i.length() < width) {
+						rv += string(width - i.length(), ' ');
+					}
+					rv += i;
+				}
+				while (line_count < height) {
+					if (line_count++) {
+						rv += "\n";
+					}
+					rv += string(width, ' ');
+				}
+				return rv;
+			};
+		}
+
+		Content
+		Number(const string &s)
+		{
+			return [=](size_t width, size_t height) -> string {
+				const auto lines = split("\n", s);
+				string rv;
+				size_t line_count = 0;
+				for (const auto &i : lines) {
+					if (line_count++) {
+						rv += "\n";
+					}
+					if (i.length() < width) {
+						rv += string(width - i.length(), ' ');
+					}
+					rv += i;
+				}
+				while (line_count < height) {
+					if (line_count++) {
+						rv += "\n";
+					}
+					rv += string(width, ' ');
+				}
+				return rv;
+			};
+		}
+
+		Cell::Cell(const Content &fn) :
+			m_content(fn)
+		{
+		}
+
+		Cell&
+		Cell::operator=(const Content &fn)
+		{
+			m_content = fn;
+			return *this;
+		}
+
+		string
+		Cell::text(size_t width, size_t height) const
+		{
+			return m_content(width, height);
+		}
+
+		size_t
+		Dimension::size() const
+		{
+			return m_elements.size();
+		}
+
+		size_t
+		Dimension::insert(size_t pos)
+		{
+			++m_next_pos;
+			const auto insert_pos = min(m_elements.size(), pos);
+			const auto it = m_elements.begin() + insert_pos;
+			m_elements.insert(it, m_next_pos);
+			return insert_pos;
+		}
+
+		void
+		Dimension::erase(size_t pos)
+		{
+			const auto it = m_elements.begin() + min(m_elements.size(), pos);
+			m_elements.erase(it);
+		}
+
+		size_t
+		Dimension::at(size_t pos) const
+		{
+			return m_elements.at(pos);
+		}
+
+		Dimension&
+		Table::rows()
+		{
+			return m_rows;
+		};
+
+		const Dimension&
+		Table::rows() const
+		{
+			return m_rows;
+		};
+
+		Dimension&
+		Table::cols()
+		{
+			return m_cols;
+		};
+
+		const Dimension&
+		Table::cols() const
+		{
+			return m_cols;
+		};
+
+		const Cell&
+		Table::at(size_t row, size_t col) const
+		{
+			const auto row_idx = m_rows.at(row);
+			const auto col_idx = m_cols.at(col);
+			const auto found = m_cells.find(make_pair(row_idx, col_idx));
+			if (found == m_cells.end()) {
+				static const Cell s_empty(Fill('.'));
+				return s_empty;
+			}
+			return found->second;
+		};
+
+		Cell&
+		Table::at(size_t row, size_t col)
+		{
+			const auto row_idx = m_rows.at(row);
+			const auto col_idx = m_cols.at(col);
+			return m_cells[make_pair(row_idx, col_idx)];
+		};
+
+		static
+		pair<size_t, size_t>
+		text_size(const string &s)
+		{
+			const auto s_split = split("\n", s);
+			size_t width = 0;
+			for (const auto &i : s_split) {
+				width = max(width, i.length());
+			}
+			return make_pair(width, s_split.size());
+		}
+
+		ostream& operator<<(ostream &os, const Table &table)
+		{
+			const auto rows = table.rows().size();
+			const auto cols = table.cols().size();
+			vector<size_t> row_heights(rows, 1);
+			vector<size_t> col_widths(cols, 1);
+			// Get the size of all fixed- and minimum-sized content cells
+			for (size_t row = 0; row < table.rows().size(); ++row) {
+				vector<string> col_text;
+				for (size_t col = 0; col < table.cols().size(); ++col) {
+					col_text.push_back(table.at(row, col).text(0, 0));
+					const auto tsize = text_size(*col_text.rbegin());
+					row_heights[row] = max(row_heights[row], tsize.second);
+					col_widths[col] = max(col_widths[col], tsize.first);
+				}
+			}
+			// Render the table
+			for (size_t row = 0; row < table.rows().size(); ++row) {
+				vector<string> lines(row_heights[row], "");
+				for (size_t col = 0; col < table.cols().size(); ++col) {
+					const auto& table_cell = table.at(row, col);
+					const auto table_text = table_cell.text(col_widths[col], row_heights[row]);
+					auto col_lines = split("\n", table_text);
+					col_lines.resize(row_heights[row], "");
+					for (size_t line = 0; line < row_heights[row]; ++line) {
+						if (col > 0) {
+							lines[line] += table.mid();
+						}
+						lines[line] += col_lines[line];
+					}
+				}
+				for (const auto &line : lines) {
+					os << table.left() << line << table.right() << "\n";
+				}
+			}
+			return os;
+		}
+
+		void
+		Table::left(const string &s)
+		{
+			m_left = s;
+		}
+
+		void
+		Table::mid(const string &s)
+		{
+			m_mid = s;
+		}
+
+		void
+		Table::right(const string &s)
+		{
+			m_right = s;
+		}
+
+		const string&
+		Table::left() const
+		{
+			return m_left;
+		}
+
+		const string&
+		Table::mid() const
+		{
+			return m_mid;
+		}
+
+		const string&
+		Table::right() const
+		{
+			return m_right;
+		}
+	}
+}
@@ -76,13 +76,24 @@ namespace crucible {
 		/// Tasks to be executed after the current task is executed
 		list<TaskStatePtr>			m_post_exec_queue;

-		/// Set by run() and append().  Cleared by exec().
+		/// Set by run(), append(), and insert().  Cleared by exec().
 		bool					m_run_now = false;

+		/// Set by insert().  Cleared by exec() and destructor.
+		bool					m_sort_queue = false;
+
 		/// Set when task starts execution by exec().
 		/// Cleared when exec() ends.
 		bool					m_is_running = false;

+		/// Set when task is queued while already running.
+		/// Cleared when task is requeued.
+		bool					m_run_again = false;
+
+		/// Set when task is queued as idle task while already running.
+		/// Cleared when task is queued as non-idle task.
+		bool					m_idle = false;
+
 		/// Sequential identifier for next task
 		static atomic<TaskId>			s_next_id;

@@ -107,7 +118,7 @@ namespace crucible {
 		static void clear_queue(TaskQueue &tq);

 		/// Rescue any TaskQueue, not just this one.
-		static void rescue_queue(TaskQueue &tq);
+		static void rescue_queue(TaskQueue &tq, const bool sort_queue);

 		TaskState &operator=(const TaskState &) = delete;
 		TaskState(const TaskState &) = delete;
@@ -124,6 +135,9 @@ namespace crucible {
 		/// instance at the end of TaskMaster's global queue.
 		void run();

+		/// Run the task when there are no more Tasks on the main queue.
+		void idle();
+
 		/// Execute task immediately in current thread if it is not already
 		/// executing in another thread; otherwise, append the current task
 		/// to itself to be executed immediately in the other thread.
@@ -139,6 +153,10 @@ namespace crucible {
 		/// or is destroyed.
 		void append(const TaskStatePtr &task);

+		/// Queue task to execute after current task finishes executing
+		/// or is destroyed, in task ID order.
+		void insert(const TaskStatePtr &task);
+
 		/// How masy Tasks are there?  Good for catching leaks
 		static size_t instance_count();
 	};
@@ -150,6 +168,7 @@ namespace crucible {
 		mutex 					m_mutex;
 		condition_variable 			m_condvar;
 		TaskQueue				m_queue;
+		TaskQueue				m_idle_queue;
 		size_t					m_thread_max;
 		size_t					m_thread_min = 0;
 		set<TaskConsumerPtr>			m_threads;
@@ -184,6 +203,7 @@ namespace crucible {
 		TaskMasterState(size_t thread_max = thread::hardware_concurrency());

 		static void push_back(const TaskStatePtr &task);
+		static void push_back_idle(const TaskStatePtr &task);
 		static void push_front(TaskQueue &queue);
 		size_t get_queue_count();
 		size_t get_thread_count();
@@ -214,16 +234,21 @@ namespace crucible {
 	static auto s_tms = make_shared<TaskMasterState>();

 	void
-	TaskState::rescue_queue(TaskQueue &queue)
+	TaskState::rescue_queue(TaskQueue &queue, const bool sort_queue)
 	{
 		if (queue.empty()) {
 			return;
 		}
-		const auto tlcc = tl_current_consumer;
+		const auto &tlcc = tl_current_consumer;
 		if (tlcc) {
 			// We are executing under a TaskConsumer, splice our post-exec queue at front.
 			// No locks needed because we are using only thread-local objects.
 			tlcc->m_local_queue.splice(tlcc->m_local_queue.begin(), queue);
+			if (sort_queue) {
+				tlcc->m_local_queue.sort([&](const TaskStatePtr &a, const TaskStatePtr &b) {
+					return a->m_id < b->m_id;
+				});
+			}
 		} else {
 			// We are not executing under a TaskConsumer.
 			// If there is only one task, then just insert it at the front of the queue.
@@ -234,6 +259,8 @@ namespace crucible {
 				// then push it to the front of the global queue using normal locking methods.
 				TaskStatePtr rescue_task(make_shared<TaskState>("rescue_task", [](){}));
 				swap(rescue_task->m_post_exec_queue, queue);
+				// Do the sort--once--when a new Consumer has picked up the Task
+				rescue_task->m_sort_queue = sort_queue;
 				TaskQueue tq_one { rescue_task };
 				TaskMasterState::push_front(tq_one);
 			}
@@ -246,7 +273,8 @@ namespace crucible {
 		--s_instance_count;
 		unique_lock<mutex> lock(m_mutex);
 		// If any dependent Tasks were appended since the last exec, run them now
-		TaskState::rescue_queue(m_post_exec_queue);
+		TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
+		// No need to clear m_sort_queue here, it won't exist soon
 	}

 	TaskState::TaskState(string title, function<void()> exec_fn) :
@@ -305,6 +333,24 @@ namespace crucible {
 			task->m_run_now = true;
 			append_nolock(task);
 		}
+		task->m_idle = false;
+	}
+
+	void
+	TaskState::insert(const TaskStatePtr &task)
+	{
+		THROW_CHECK0(invalid_argument, task);
+		THROW_CHECK2(invalid_argument, m_id, task->m_id, m_id != task->m_id);
+		PairLock lock(m_mutex, task->m_mutex);
+		if (!task->m_run_now) {
+			task->m_run_now = true;
+			// Move the task and its post-exec queue to follow this task,
+			// and request a sort of the flattened list.
+			m_sort_queue = true;
+			m_post_exec_queue.push_back(task);
+			m_post_exec_queue.splice(m_post_exec_queue.end(), task->m_post_exec_queue);
+		}
+		task->m_idle = false;
 	}

 	void
@@ -315,7 +361,7 @@ namespace crucible {

 		unique_lock<mutex> lock(m_mutex);
 		if (m_is_running) {
-			append_nolock(shared_from_this());
+			m_run_again = true;
 			return;
 		} else {
 			m_run_now = false;
@@ -339,8 +385,20 @@ namespace crucible {
 		swap(this_task, tl_current_task);
 		m_is_running = false;

+		if (m_run_again) {
+			m_run_again = false;
+			if (m_idle) {
+				// All the way back to the end of the line
+				TaskMasterState::push_back_idle(shared_from_this());
+			} else {
+				// Insert after any dependents waiting for this Task
+				m_post_exec_queue.push_back(shared_from_this());
+			}
+		}
+
 		// Splice task post_exec queue at front of local queue
-		TaskState::rescue_queue(m_post_exec_queue);
+		TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
+		m_sort_queue = false;
 	}

 	string
@@ -360,11 +418,32 @@ namespace crucible {
 	TaskState::run()
 	{
 		unique_lock<mutex> lock(m_mutex);
+		m_idle = false;
 		if (m_run_now) {
 			return;
 		}
 		m_run_now = true;
-		TaskMasterState::push_back(shared_from_this());
+		if (m_is_running) {
+			m_run_again = true;
+		} else {
+			TaskMasterState::push_back(shared_from_this());
+		}
+	}
+
+	void
+	TaskState::idle()
+	{
+		unique_lock<mutex> lock(m_mutex);
+		m_idle = true;
+		if (m_run_now) {
+			return;
+		}
+		m_run_now = true;
+		if (m_is_running) {
+			m_run_again = true;
+		} else {
+			TaskMasterState::push_back_idle(shared_from_this());
+		}
 	}

 	TaskMasterState::TaskMasterState(size_t thread_max) :
@@ -410,6 +489,20 @@ namespace crucible {
 		s_tms->start_threads_nolock();
 	}

+	void
+	TaskMasterState::push_back_idle(const TaskStatePtr &task)
+	{
+		THROW_CHECK0(runtime_error, task);
+		unique_lock<mutex> lock(s_tms->m_mutex);
+		if (s_tms->m_cancelled) {
+			task->clear();
+			return;
+		}
+		s_tms->m_idle_queue.push_back(task);
+		s_tms->m_condvar.notify_all();
+		s_tms->start_threads_nolock();
+	}
+
 	void
 	TaskMasterState::push_front(TaskQueue &queue)
 	{
@@ -456,12 +549,26 @@ namespace crucible {
 	TaskMaster::print_queue(ostream &os)
 	{
 		unique_lock<mutex> lock(s_tms->m_mutex);
-		os << "Queue (size " << s_tms->m_queue.size() << "):" << endl;
+		auto queue_copy = s_tms->m_queue;
+		lock.unlock();
+		os << "Queue (size " << queue_copy.size() << "):" << endl;
 		size_t counter = 0;
-		for (auto i : s_tms->m_queue) {
+		for (auto i : queue_copy) {
 			os << "Queue #" << ++counter << " Task ID " << i->id() << " " << i->title() << endl;
 		}
-		return os << "Queue End" << endl;
+		os << "Queue End" << endl;
+
+		lock.lock();
+		queue_copy = s_tms->m_idle_queue;
+		lock.unlock();
+		os << "Idle (size " << queue_copy.size() << "):" << endl;
+		counter = 0;
+		for (const auto &i : queue_copy) {
+			os << "Idle #" << ++counter << " Task ID " << i->id() << " " << i->title() << endl;
+		}
+		os << "Idle End" << endl;
+
+		return os;
 	}

 	ostream &
@@ -486,11 +593,6 @@ namespace crucible {
 	size_t
 	TaskMasterState::calculate_thread_count_nolock()
 	{
-		if (m_paused) {
-			// No threads running while paused or cancelled
-			return 0;
-		}
-
 		if (m_load_target == 0) {
 			// No limits, no stats, use configured thread count
 			return m_configured_thread_max;
@@ -583,6 +685,7 @@ namespace crucible {
 		m_cancelled = true;
 		decltype(m_queue) empty_queue;
 		m_queue.swap(empty_queue);
+		empty_queue.splice(empty_queue.end(), m_idle_queue);
 		m_condvar.notify_all();
 		lock.unlock();
 		TaskState::clear_queue(empty_queue);
@@ -600,6 +703,9 @@ namespace crucible {
 		unique_lock<mutex> lock(m_mutex);
 		m_paused = paused;
 		m_condvar.notify_all();
+		if (!m_paused) {
+			start_threads_nolock();
+		}
 		lock.unlock();
 	}

@@ -682,6 +788,13 @@ namespace crucible {
 		m_task_state->run();
 	}

+	void
+	Task::idle() const
+	{
+		THROW_CHECK0(runtime_error, m_task_state);
+		m_task_state->idle();
+	}
+
 	void
 	Task::append(const Task &that) const
 	{
@@ -690,6 +803,14 @@ namespace crucible {
 		m_task_state->append(that.m_task_state);
 	}

+	void
+	Task::insert(const Task &that) const
+	{
+		THROW_CHECK0(runtime_error, m_task_state);
+		THROW_CHECK0(runtime_error, that);
+		m_task_state->insert(that.m_task_state);
+	}
+
 	Task
 	Task::current_task()
 	{
@@ -772,6 +893,9 @@ namespace crucible {
 			} else if (!master_copy->m_queue.empty()) {
 				m_current_task = *master_copy->m_queue.begin();
 				master_copy->m_queue.pop_front();
+			} else if (!master_copy->m_idle_queue.empty()) {
+				m_current_task = *master_copy->m_idle_queue.begin();
+				master_copy->m_idle_queue.pop_front();
 			} else {
 				master_copy->m_condvar.wait(lock);
 				continue;
@@ -801,11 +925,13 @@ namespace crucible {
 		swap(this_consumer, tl_current_consumer);
 		assert(!tl_current_consumer);

-		// Release lock to rescue queue (may attempt to queue a new task at TaskMaster).
-		// rescue_queue normally sends tasks to the local queue of the current TaskConsumer thread,
-		// but we just disconnected ourselves from that.
+		// Release lock to rescue queue (may attempt to queue a
+		// new task at TaskMaster).  rescue_queue normally sends
+		// tasks to the local queue of the current TaskConsumer
+		// thread, but we just disconnected ourselves from that.
+		// No sorting here because this is not a TaskState.
 		lock.unlock();
-		TaskState::rescue_queue(m_local_queue);
+		TaskState::rescue_queue(m_local_queue, false);

 		// Hold lock so we can erase ourselves
 		lock.lock();
@@ -883,21 +1009,6 @@ namespace crucible {
 		m_owner.reset();
 	}

-	void
-	Exclusion::insert_task(const Task &task)
-	{
-		unique_lock<mutex> lock(m_mutex);
-		const auto sp = m_owner.lock();
-		lock.unlock();
-		if (sp) {
-			// If Exclusion is locked then queue task for release;
-			sp->append(task);
-		} else {
-			// otherwise, run the inserted task immediately
-			task.run();
-		}
-	}
-
 	ExclusionLock
 	Exclusion::try_lock(const Task &task)
 	{
@@ -905,7 +1016,7 @@ namespace crucible {
 		const auto sp = m_owner.lock();
 		if (sp) {
 			if (task) {
-				sp->append(task);
+				sp->insert(task);
 			}
 			return ExclusionLock();
 		} else {
@@ -98,12 +98,16 @@ namespace crucible {
 		m_rate(rate),
 		m_burst(burst)
 	{
+		THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
+		THROW_CHECK1(invalid_argument, m_burst, m_burst >= 0);
 	}

 	RateLimiter::RateLimiter(double rate) :
 		m_rate(rate),
 		m_burst(rate)
 	{
+		THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
+		THROW_CHECK1(invalid_argument, m_burst, m_burst >= 0);
 	}

 	void
@@ -119,6 +123,7 @@ namespace crucible {
 	double
 	RateLimiter::sleep_time(double cost)
 	{
+		THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
 		borrow(cost);
 		unique_lock<mutex> lock(m_mutex);
 		update_tokens();
@@ -154,6 +159,21 @@ namespace crucible {
 		m_tokens -= cost;
 	}

+	void
+	RateLimiter::rate(double const new_rate)
+	{
+		THROW_CHECK1(invalid_argument, new_rate, new_rate > 0);
+		unique_lock<mutex> lock(m_mutex);
+		m_rate = new_rate;
+	}
+
+	double
+	RateLimiter::rate() const
+	{
+		unique_lock<mutex> lock(m_mutex);
+		return m_rate;
+	}
+
 	RateEstimator::RateEstimator(double min_delay, double max_delay) :
 		m_min_delay(min_delay),
 		m_max_delay(max_delay)
@@ -202,6 +222,13 @@ namespace crucible {
 		}
 	}

+	void
+	RateEstimator::increment(const uint64_t more)
+	{
+		unique_lock<mutex> lock(m_mutex);
+		return update_unlocked(m_last_count + more);
+	}
+
 	uint64_t
 	RateEstimator::count() const
 	{
@@ -1,5 +1,13 @@
 #!/bin/bash

+# if not called from systemd try to replicate mount unsharing on ctrl+c
+# see: https://github.com/Zygo/bees/issues/281
+if [ -z "${SYSTEMD_EXEC_PID}" -a -z "${UNSHARE_DONE}" ]; then
+        UNSHARE_DONE=true
+        export UNSHARE_DONE
+        exec unshare -m --propagation private -- "$0" "$@"
+fi
+
 ## Helpful functions
 INFO(){ echo "INFO:" "$@"; }
 ERRO(){ echo "ERROR:" "$@"; exit 1; }
@@ -108,13 +116,11 @@ mkdir -p "$WORK_DIR" || exit 1
 INFO "MOUNT DIR: $MNT_DIR"
 mkdir -p "$MNT_DIR" || exit 1

-mount --make-private -osubvolid=5 /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
+mount --make-private -osubvolid=5,nodev,noexec /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1

 if [ ! -d "$BEESHOME" ]; then
    INFO "Create subvol $BEESHOME for store bees data"
    btrfs sub cre "$BEESHOME"
-else
-    btrfs sub show "$BEESHOME" &> /dev/null || ERRO "$BEESHOME MUST BE A SUBVOL!"
 fi

 # Check DB size
@@ -17,6 +17,7 @@ KillSignal=SIGTERM
 MemoryAccounting=true
 Nice=19
 Restart=on-abnormal
+RuntimeDirectoryMode=0700
 RuntimeDirectory=bees
 StartupCPUWeight=25
 StartupIOWeight=25
@@ -20,7 +20,6 @@
 using namespace crucible;
 using namespace std;

-
 BeesFdCache::BeesFdCache(shared_ptr<BeesContext> ctx) :
 	m_ctx(ctx)
 {
@@ -98,6 +97,9 @@ BeesContext::dump_status()
 		TaskMaster::print_queue(ofs);
 #endif

+		ofs << "PROGRESS:\n";
+		ofs << get_progress();
+
 		ofs.close();

 		BEESNOTE("renaming status file '" << status_file << "'");
@@ -112,6 +114,23 @@ BeesContext::dump_status()
 	}
 }

+void
+BeesContext::set_progress(const string &str)
+{
+	unique_lock<mutex> lock(m_progress_mtx);
+	m_progress_str = str;
+}
+
+string
+BeesContext::get_progress()
+{
+	unique_lock<mutex> lock(m_progress_mtx);
+	if (m_progress_str.empty()) {
+		return "[No progress estimate available]\n";
+	}
+	return m_progress_str;
+}
+
 void
 BeesContext::show_progress()
 {
@@ -159,6 +178,8 @@ BeesContext::show_progress()
 			BEESLOGINFO("\ttid " << t.first << ": " << t.second);
 		}

+		// No need to log progress here, it is logged when set
+
 		lastStats = thisStats;
 	}
 }
@@ -182,7 +203,7 @@ BeesContext::home_fd()
 }

 bool
-BeesContext::is_root_ro(uint64_t root)
+BeesContext::is_root_ro(uint64_t const root)
 {
 	return roots()->is_root_ro(root);
 }
@@ -192,6 +213,7 @@ BeesContext::dedup(const BeesRangePair &brp_in)
 {
 	// TOOLONG and NOTE can retroactively fill in the filename details, but LOG can't
 	BEESNOTE("dedup " << brp_in);
+	BEESTRACE("dedup " << brp_in);

 	if (is_root_ro(brp_in.second.fid().root())) {
 		// BEESLOGDEBUG("WORKAROUND: dst root " << (brp_in.second.fid().root()) << " is read-only);
@@ -208,8 +230,10 @@ BeesContext::dedup(const BeesRangePair &brp_in)
 	BeesAddress first_addr(brp.first.fd(), brp.first.begin());
 	BeesAddress second_addr(brp.second.fd(), brp.second.begin());

-	if (first_addr.get_physical_or_zero() == second_addr.get_physical_or_zero()) {
-		BEESLOGTRACE("equal physical addresses in dedup");
+	const auto first_gpoz = first_addr.get_physical_or_zero();
+	const auto second_gpoz = second_addr.get_physical_or_zero();
+	if (first_gpoz == second_gpoz) {
+		BEESLOGDEBUG("equal physical addresses " << first_addr << " and " << second_addr << " in dedup");
 		BEESCOUNT(bug_dedup_same_physical);
 	}

@@ -219,27 +243,40 @@ BeesContext::dedup(const BeesRangePair &brp_in)
 	BEESCOUNT(dedup_try);

 	BEESNOTE("waiting to dedup " << brp);
-	const auto lock = MultiLocker::get_lock("dedupe");
-
-	Timer dedup_timer;
+	auto lock = MultiLocker::get_lock("dedupe");

 	BEESLOGINFO("dedup: src " << pretty(brp.first.size())  << " [" << to_hex(brp.first.begin())  << ".." << to_hex(brp.first.end())  << "] {" << first_addr  << "} " << name_fd(brp.first.fd()) << "\n"
 		 << "       dst " << pretty(brp.second.size()) << " [" << to_hex(brp.second.begin()) << ".." << to_hex(brp.second.end()) << "] {" << second_addr << "} " << name_fd(brp.second.fd()));
 	BEESNOTE("dedup: src " << pretty(brp.first.size())  << " [" << to_hex(brp.first.begin())  << ".." << to_hex(brp.first.end())  << "] {" << first_addr  << "} " << name_fd(brp.first.fd()) << "\n"
 		 << "       dst " << pretty(brp.second.size()) << " [" << to_hex(brp.second.begin()) << ".." << to_hex(brp.second.end()) << "] {" << second_addr << "} " << name_fd(brp.second.fd()));

-	const bool rv = btrfs_extent_same(brp.first.fd(), brp.first.begin(), brp.first.size(), brp.second.fd(), brp.second.begin());
-	BEESCOUNTADD(dedup_ms, dedup_timer.age() * 1000);
+	while (true) {
+		try {
+			Timer dedup_timer;
+			const bool rv = btrfs_extent_same(brp.first.fd(), brp.first.begin(), brp.first.size(), brp.second.fd(), brp.second.begin());
+			BEESCOUNTADD(dedup_ms, dedup_timer.age() * 1000);

-	if (rv) {
-		BEESCOUNT(dedup_hit);
-		BEESCOUNTADD(dedup_bytes, brp.first.size());
-	} else {
-		BEESCOUNT(dedup_miss);
-		BEESLOGWARN("NO Dedup! " << brp);
+			if (rv) {
+				BEESCOUNT(dedup_hit);
+				BEESCOUNTADD(dedup_bytes, brp.first.size());
+			} else {
+				BEESCOUNT(dedup_miss);
+				BEESLOGINFO("NO Dedup! " << brp);
+			}
+
+			lock.reset();
+			bees_throttle(dedup_timer.age(), "dedup");
+			return rv;
+		} catch (const std::system_error &e) {
+			if (e.code().value() == EAGAIN) {
+				BEESNOTE("dedup waiting for btrfs send on " << brp.second);
+				BEESLOGDEBUG("dedup waiting for btrfs send on " << brp.second);
+				roots()->wait_for_transid(1);
+			} else {
+				throw;
+			}
+		}
 	}
-
-	return rv;
 }

 BeesRangePair
@@ -264,6 +301,7 @@ BeesContext::rewrite_file_range(const BeesFileRange &bfr)
 	// BEESLOG("BeesResolver br(..., " << bfr << ")");
 	BEESTRACE("BeesContext::rewrite_file_range calling BeesResolver " << bfr);
 	BeesResolver br(m_ctx, BeesAddress(bfr.fd(), bfr.begin()));
+	BEESTRACE("BeesContext::rewrite_file_range calling replace_src " << dup_bbd);
 	// BEESLOG("\treplace_src " << dup_bbd);
 	br.replace_src(dup_bbd);
 	BEESCOUNT(scan_rewrite);
@@ -291,23 +329,38 @@ BeesContext::rewrite_file_range(const BeesFileRange &bfr)
 	}
 }

-BeesFileRange
+struct BeesSeenRange {
+	uint64_t bytenr;
+	off_t offset;
+	off_t length;
+};
+
+static
+bool
+operator<(const BeesSeenRange &bsr1, const BeesSeenRange &bsr2)
+{
+	return tie(bsr1.bytenr, bsr1.offset, bsr1.length) < tie(bsr2.bytenr, bsr2.offset, bsr2.length);
+}
+
+static
+__attribute__((unused))
+ostream&
+operator<<(ostream &os, const BeesSeenRange &tup)
+{
+	return os << "BeesSeenRange { " << to_hex(tup.bytenr) << ", " << to_hex(tup.offset) << "+" << pretty(tup.length) << " }";
+}
+
+void
 BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 {
 	BEESNOTE("Scanning " << pretty(e.size()) << " "
 		<< to_hex(e.begin()) << ".." << to_hex(e.end())
 		<< " " << name_fd(bfr.fd()) );
 	BEESTRACE("scan extent " << e);
+	BEESTRACE("scan bfr " << bfr);
 	BEESCOUNT(scan_extent);

-	// EXPERIMENT:  Don't bother with tiny extents unless they are the entire file.
-	// We'll take a tiny extent at BOF or EOF but not in between.
-	if (e.begin() && e.size() < 128 * 1024 && e.end() != Stat(bfr.fd()).st_size) {
-		BEESCOUNT(scan_extent_tiny);
-		// This doesn't work properly with the current architecture,
-		// so we don't do an early return here.
-		// return bfr;
-	}
+	Timer one_timer;

 	// We keep moving this method around
 	auto m_ctx = shared_from_this();
@@ -322,19 +375,19 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 		Extent::OBSCURED | Extent::PREALLOC
 	)) {
 		BEESCOUNT(scan_interesting);
-		BEESLOGWARN("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
+		BEESLOGINFO("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
 	}

 	if (e.flags() & Extent::HOLE) {
 		// Nothing here, dispose of this early
 		BEESCOUNT(scan_hole);
-		return bfr;
+		return;
 	}

 	if (e.flags() & Extent::PREALLOC) {
 		// Prealloc is all zero and we replace it with a hole.
 		// No special handling is required here.  Nuke it and move on.
-		BEESLOGINFO("prealloc extent " << e);
+		BEESLOGINFO("prealloc extent " << e << " in " << bfr);
 		// Must not extend past EOF
 		auto extent_size = min(e.end(), bfr.file_size()) - e.begin();
 		// Must hold tmpfile until dedupe is done
@@ -347,38 +400,57 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 		if (m_ctx->dedup(brp)) {
 			BEESCOUNT(dedup_prealloc_hit);
 			BEESCOUNTADD(dedup_prealloc_bytes, e.size());
-			return bfr;
+			return;
 		} else {
 			BEESCOUNT(dedup_prealloc_miss);
 		}
 	}

+	// If we already read this extent and inserted it into the hash table, no need to read it again
+	static mutex s_seen_mutex;
+	unique_lock<mutex> lock_seen(s_seen_mutex);
+	const BeesSeenRange tup = {
+		.bytenr = e.bytenr(),
+		.offset = e.offset(),
+		.length = e.size(),
+	};
+	static set<BeesSeenRange> s_seen;
+	if (s_seen.size() > BEES_MAX_EXTENT_REF_COUNT) {
+		s_seen.clear();
+		BEESCOUNT(scan_seen_clear);
+	}
+	const auto seen_rv = s_seen.find(tup) != s_seen.end();
+	if (!seen_rv) {
+		BEESCOUNT(scan_seen_miss);
+	} else {
+		// BEESLOGDEBUG("Skip " << tup << " " << e);
+		BEESCOUNT(scan_seen_hit);
+		return;
+	}
+	lock_seen.unlock();
+
 	// OK we need to read extent now
 	bees_readahead(bfr.fd(), bfr.begin(), bfr.size());

 	map<off_t, pair<BeesHash, BeesAddress>> insert_map;
-	set<off_t> noinsert_set;
-
-	// Hole handling
-	bool extent_compressed = e.flags() & FIEMAP_EXTENT_ENCODED;
-	bool extent_contains_zero = false;
-	bool extent_contains_nonzero = false;
-
-	// Need to replace extent
-	bool rewrite_extent = false;
+	set<off_t> dedupe_set;
+	set<off_t> zero_set;

 	// Pretty graphs
 	off_t block_count = ((e.size() + BLOCK_MASK_SUMS) & ~BLOCK_MASK_SUMS) / BLOCK_SIZE_SUMS;
 	BEESTRACE(e << " block_count " << block_count);
 	string bar(block_count, '#');

-	for (off_t next_p = e.begin(); next_p < e.end(); ) {
+	// List of dedupes found
+	list<BeesRangePair> dedupe_list;
+	list<BeesFileRange> copy_list;
+	list<pair<BeesHash, BeesAddress>> front_hash_list;
+	list<uint64_t> invalidate_addr_list;

-		// Guarantee forward progress
-		off_t p = next_p;
-		next_p += BLOCK_SIZE_SUMS;
+	off_t next_p = e.begin();
+	for (off_t p = e.begin(); p < e.end(); p += BLOCK_SIZE_SUMS) {

-		off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
+		const off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
 		BeesAddress addr(e, p);

 		// This extent should consist entirely of non-magic blocks
@@ -393,69 +465,68 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)

 		// Calculate the hash first because it lets us shortcut on is_data_zero
 		BEESNOTE("scan hash " << bbd);
-		BeesHash hash = bbd.hash();
+		const BeesHash hash = bbd.hash();
+
+		// Weed out zero blocks
+		BEESNOTE("is_data_zero " << bbd);
+		const bool data_is_zero = bbd.is_data_zero();
+		if (data_is_zero) {
+			bar.at(bar_p) = '0';
+			zero_set.insert(p);
+			BEESCOUNT(scan_zero);
+			continue;
+		}

 		// Schedule this block for insertion if we decide to keep this extent.
 		BEESCOUNT(scan_hash_preinsert);
 		BEESTRACE("Pushing hash " << hash << " addr " << addr << " bbd " << bbd);
 		insert_map.insert(make_pair(p, make_pair(hash, addr)));
-		bar.at(bar_p) = 'R';
+		bar.at(bar_p) = 'i';

-		// Weed out zero blocks
-		BEESNOTE("is_data_zero " << bbd);
-		bool extent_is_zero = bbd.is_data_zero();
-		if (extent_is_zero) {
-			bar.at(bar_p) = '0';
-			if (extent_compressed) {
-				if (!extent_contains_zero) {
-					// BEESLOG("compressed zero bbd " << bbd << "\n\tin extent " << e);
-				}
-				extent_contains_zero = true;
-				// Do not attempt to lookup hash of zero block
-				continue;
-			} else {
-				BEESLOGINFO("zero bbd " << bbd << "\n\tin extent " << e);
-				BEESCOUNT(scan_zero_uncompressed);
-				rewrite_extent = true;
-				break;
-			}
-		} else {
-			if (extent_contains_zero && !extent_contains_nonzero) {
-				// BEESLOG("compressed nonzero bbd " << bbd << "\n\tin extent " << e);
-			}
-			extent_contains_nonzero = true;
-		}
+		// Ensure we fill in the entire insert_map without skipping any non-zero blocks
+		if (p < next_p) continue;

 		BEESNOTE("lookup hash " << bbd);
-		auto found = hash_table->find_cell(hash);
+		const auto found = hash_table->find_cell(hash);
 		BEESCOUNT(scan_lookup);

-		set<BeesResolver> resolved_addrs;
 		set<BeesAddress> found_addrs;
+		list<BeesAddress> ordered_addrs;

-		// We know that there is at least one copy of the data and where it is,
-		// but we don't want to do expensive LOGICAL_INO operations unless there
-		// are at least two distinct addresses to look at.
-		found_addrs.insert(addr);
-
-		for (auto i : found) {
+		for (const auto &i : found) {
 			BEESTRACE("found (hash, address): " << i);
 			BEESCOUNT(scan_found);

 			// Hash has to match
 			THROW_CHECK2(runtime_error, i.e_hash, hash, i.e_hash == hash);

+			// We know that there is at least one copy of the data and where it is.
+			// Filter out anything that can't possibly match before we pull out the
+			// LOGICAL_INO hammer.
 			BeesAddress found_addr(i.e_addr);

-#if 0
 			// If address already in hash table, move on to next extent.
-			// We've already seen this block and may have made additional references to it.
-			// The current extent is effectively "pinned" and can't be modified any more.
+			// Only extents that are scanned but not modified are inserted, so if there's
+			// a matching hash:address pair in the hash table:
+			// 1.  We have already scanned this extent.
+			// 2.  We may have already created references to this extent.
+			// 3.  We won't scan this extent again.
+			// The current extent is effectively "pinned" and can't be modified
+			// without rescanning all the existing references.
 			if (found_addr.get_physical_or_zero() == addr.get_physical_or_zero()) {
+				// No log message because this happens to many thousands of blocks
+				// when bees is interrupted.
+				// BEESLOGDEBUG("Found matching hash " << hash << " at same address " << addr << ", skipping " << bfr);
 				BEESCOUNT(scan_already);
-				return bfr;
+				return;
+			}
+
+			// Address is a duplicate.
+			// Check this early so we don't have duplicate counts.
+			if (!found_addrs.insert(found_addr).second) {
+				BEESCOUNT(scan_twice);
+				continue;
 			}
-#endif

 			// Block must have matching EOF alignment
 			if (found_addr.is_unaligned_eof() != addr.is_unaligned_eof()) {
@@ -463,214 +534,353 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 				continue;
 			}

-			// Address is a duplicate
-			if (!found_addrs.insert(found_addr).second) {
-				BEESCOUNT(scan_twice);
-				continue;
-			}
-
 			// Hash is toxic
 			if (found_addr.is_toxic()) {
-				BEESLOGWARN("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
+				BEESLOGDEBUG("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
 				// Don't push these back in because we'll never delete them.
 				// Extents may become non-toxic so give them a chance to expire.
 				// hash_table->push_front_hash_addr(hash, found_addr);
 				BEESCOUNT(scan_toxic_hash);
-				return bfr;
+				return;
 			}

-			// Distinct address, go resolve it
-			bool abandon_extent = false;
-			catch_all([&]() {
-				BEESNOTE("resolving " << found_addr << " matched " << bbd);
-				BEESTRACE("resolving " << found_addr << " matched " << bbd);
-				BEESTRACE("BeesContext::scan_one_extent calling BeesResolver " << found_addr);
-				BeesResolver resolved(m_ctx, found_addr);
-				// Toxic extents are really toxic
-				if (resolved.is_toxic()) {
-					BEESLOGWARN("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
-					BEESCOUNT(scan_toxic_match);
-					// Make sure we never see this hash again.
-					// It has become toxic since it was inserted into the hash table.
-					found_addr.set_toxic();
-					hash_table->push_front_hash_addr(hash, found_addr);
-					abandon_extent = true;
-				} else if (!resolved.count()) {
-					BEESCOUNT(scan_resolve_zero);
-					// Didn't find anything, address is dead
-					BEESTRACE("matched hash " << hash << " addr " << addr << " count zero");
-					hash_table->erase_hash_addr(hash, found_addr);
-				} else {
-					resolved_addrs.insert(resolved);
-					BEESCOUNT(scan_resolve_hit);
-				}
-			});
+			// Put this address in the list without changing hash table order
+			ordered_addrs.push_back(found_addr);
+		}

-			if (abandon_extent) {
-				return bfr;
+		// Cheap filtering is now out of the way, now for some heavy lifting
+		for (auto found_addr : ordered_addrs) {
+			// Hash table says there's a matching block on the filesystem.
+			// Go find refs to it.
+			BEESNOTE("resolving " << found_addr << " matched " << bbd);
+			BEESTRACE("resolving " << found_addr << " matched " << bbd);
+			BEESTRACE("BeesContext::scan_one_extent calling BeesResolver " << found_addr);
+			BeesResolver resolved(m_ctx, found_addr);
+			// Toxic extents are really toxic
+			if (resolved.is_toxic()) {
+				BEESLOGDEBUG("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
+				BEESCOUNT(scan_toxic_match);
+				// Make sure we never see this hash again.
+				// It has become toxic since it was inserted into the hash table.
+				found_addr.set_toxic();
+				hash_table->push_front_hash_addr(hash, found_addr);
+				return;
+			} else if (!resolved.count()) {
+				BEESCOUNT(scan_resolve_zero);
+				// Didn't find a block at the table address, address is dead
+				BEESLOGDEBUG("Erasing stale addr " << addr << " hash " << hash);
+				hash_table->erase_hash_addr(hash, found_addr);
+				continue;
+			} else {
+				BEESCOUNT(scan_resolve_hit);
 			}
-		}

-		// This shouldn't happen (often), so let's count it separately
-		if (resolved_addrs.size() > 2) {
-			BEESCOUNT(matched_3_or_more);
-		}
-		if (resolved_addrs.size() > 1) {
-			BEESCOUNT(matched_2_or_more);
-		}
-
-		// No need to do all this unless there are two or more distinct matches
-		if (!resolved_addrs.empty()) {
+			// `resolved` contains references to a block on the filesystem that still exists.
 			bar.at(bar_p) = 'M';
-			BEESCOUNT(matched_1_or_more);
-			BEESTRACE("resolved_addrs.size() = " << resolved_addrs.size());
-			BEESNOTE("resolving " << resolved_addrs.size() << " matches for hash " << hash);

-			BeesFileRange replaced_bfr;
+			BEESNOTE("finding one match (out of " << resolved.count() << ") at " << resolved.addr() << " for " << bbd);
+			BEESTRACE("finding one match (out of " << resolved.count() << ") at " << resolved.addr() << " for " << bbd);
+			auto replaced_brp = resolved.replace_dst(bbd);
+			BeesFileRange &replaced_bfr = replaced_brp.second;
+			BEESTRACE("next_p " << to_hex(next_p) << " -> replaced_bfr " << replaced_bfr);

-			BeesAddress last_replaced_addr;
-			for (auto it = resolved_addrs.begin(); it != resolved_addrs.end(); ++it) {
-				// FIXME:  Need to terminate this loop on replace_dst exception condition
-				// catch_all([&]() {
-					auto it_copy = *it;
-					BEESNOTE("finding one match (out of " << it_copy.count() << ") at " << it_copy.addr() << " for " << bbd);
-					BEESTRACE("finding one match (out of " << it_copy.count() << ") at " << it_copy.addr() << " for " << bbd);
-					replaced_bfr = it_copy.replace_dst(bbd);
-					BEESTRACE("next_p " << to_hex(next_p) << " -> replaced_bfr " << replaced_bfr);
-
-					// If we didn't find this hash where the hash table said it would be,
-					// correct the hash table.
-					if (it_copy.found_hash()) {
-						BEESCOUNT(scan_hash_hit);
-					} else {
-						// BEESLOGDEBUG("erase src hash " << hash << " addr " << it_copy.addr());
-						BEESCOUNT(scan_hash_miss);
-						hash_table->erase_hash_addr(hash, it_copy.addr());
-					}
-
-					if (it_copy.found_dup()) {
-						BEESCOUNT(scan_dup_hit);
-
-						// FIXME:  we will thrash if we let multiple references to identical blocks
-						// exist in the hash table.  Erase all but the last one.
-						if (last_replaced_addr) {
-							BEESLOGINFO("Erasing redundant hash " << hash << " addr " << last_replaced_addr);
-							hash_table->erase_hash_addr(hash, last_replaced_addr);
-							BEESCOUNT(scan_erase_redundant);
-						}
-						last_replaced_addr = it_copy.addr();
-
-						// Invalidate resolve cache so we can count refs correctly
-						m_ctx->invalidate_addr(it_copy.addr());
-						m_ctx->invalidate_addr(bbd.addr());
-
-						// Remove deduped blocks from insert map
-						THROW_CHECK0(runtime_error, replaced_bfr);
-						for (off_t ip = replaced_bfr.begin(); ip < replaced_bfr.end(); ip += BLOCK_SIZE_SUMS) {
-							BEESCOUNT(scan_dup_block);
-							noinsert_set.insert(ip);
-							if (ip >= e.begin() && ip < e.end()) {
-								off_t bar_p = (ip - e.begin()) / BLOCK_SIZE_SUMS;
-								bar.at(bar_p) = 'd';
-							}
-						}
-
-						// next_p may be past EOF so check p only
-						THROW_CHECK2(runtime_error, p, replaced_bfr, p < replaced_bfr.end());
-
-						BEESCOUNT(scan_bump);
-						next_p = replaced_bfr.end();
-					} else {
-						BEESCOUNT(scan_dup_miss);
-					}
-				// });
+			// If we did find a block, but not this hash, correct the hash table and move on
+			if (resolved.found_hash()) {
+				BEESCOUNT(scan_hash_hit);
+			} else {
+				BEESLOGDEBUG("Erasing stale hash " << hash << " addr " << resolved.addr());
+				hash_table->erase_hash_addr(hash, resolved.addr());
+				BEESCOUNT(scan_hash_miss);
+				continue;
 			}
-			if (last_replaced_addr) {
-				// If we replaced extents containing the incoming addr,
-				// push the addr we kept to the front of the hash LRU.
-				hash_table->push_front_hash_addr(hash, last_replaced_addr);
-				BEESCOUNT(scan_push_front);
+
+			// We found a block and it was a duplicate
+			if (resolved.found_dup()) {
+				THROW_CHECK0(runtime_error, replaced_bfr);
+				BEESCOUNT(scan_dup_hit);
+
+				// Save this match.  If a better match is found later,
+				// it will be replaced.
+				dedupe_list.push_back(replaced_brp);
+
+				// Push matching block to front of LRU
+				front_hash_list.push_back(make_pair(hash, resolved.addr()));
+
+				// This is the block that matched in the replaced bfr
+				bar.at(bar_p) = '=';
+
+				// Invalidate resolve cache so we can count refs correctly
+				invalidate_addr_list.push_back(resolved.addr());
+				invalidate_addr_list.push_back(bbd.addr());
+
+				// next_p may be past EOF so check p only
+				THROW_CHECK2(runtime_error, p, replaced_bfr, p < replaced_bfr.end());
+
+				// We may find duplicate ranges of various lengths, so make sure
+				// we don't pick a smaller one
+				next_p = max(next_p, replaced_bfr.end());
+
+				// Stop after one dedupe is found.  If there's a longer matching range
+				// out there, we'll find a matching block after the end of this range,
+				// since the longer range is longer than this one.
+				break;
+			} else {
+				BEESCOUNT(scan_dup_miss);
 			}
-		} else {
-			BEESCOUNT(matched_0);
 		}
 	}

-	// If the extent was compressed and all zeros, nuke entire thing
-	if (!rewrite_extent && (extent_contains_zero && !extent_contains_nonzero)) {
-		rewrite_extent = true;
-		BEESCOUNT(scan_zero_compressed);
+	bool force_insert = false;
+
+	// We don't want to punch holes into compressed extents, unless:
+	// 1.  There was dedupe of non-zero blocks, so we always have to copy the rest of the extent
+	// 2.  The entire extent is zero and the whole thing can be replaced with a single hole
+	const bool extent_compressed = e.flags() & FIEMAP_EXTENT_ENCODED;
+	if (extent_compressed && dedupe_list.empty() && !insert_map.empty()) {
+		// BEESLOGDEBUG("Compressed extent with non-zero data and no dedupe, skipping");
+		BEESCOUNT(scan_compressed_no_dedup);
+		force_insert = true;
 	}

-	// If we deduped any blocks then we must rewrite the remainder of the extent
-	if (!noinsert_set.empty()) {
-		rewrite_extent = true;
+	// FIXME:  dedupe_list contains a lot of overlapping matches.  Get rid of all but one.
+	list<BeesRangePair> dedupe_list_out;
+	dedupe_list.sort([](const BeesRangePair &a, const BeesRangePair &b) {
+		return b.second.size() < a.second.size();
+	});
+	// Shorten each dedupe brp by removing any overlap with earlier (longer) extents in list
+	for (auto i : dedupe_list) {
+		bool insert_i = true;
+		BEESTRACE("i = " << i << " insert_i " << insert_i);
+		for (const auto &j : dedupe_list_out) {
+			BEESTRACE("j = " << j);
+			// No overlap, try next one
+			if (j.second.end() <= i.second.begin() || j.second.begin() >= i.second.end()) {
+				continue;
+			}
+			// j fully overlaps or is the same as i, drop i
+			if (j.second.begin() <= i.second.begin() && j.second.end() >= i.second.end()) {
+				insert_i = false;
+				break;
+			}
+			// i begins outside j, i ends inside j, remove the end of i
+			if (i.second.end() > j.second.begin() && i.second.begin() <= j.second.begin()) {
+				const auto delta = i.second.end() - j.second.begin();
+				if (delta == i.second.size()) {
+					insert_i = false;
+					break;
+				}
+				i.shrink_end(delta);
+				continue;
+			}
+			// i begins inside j, ends outside j, remove the begin of i
+			if (i.second.begin() < j.second.end() && i.second.end() >= j.second.end()) {
+				const auto delta = j.second.end() - i.second.begin();
+				if (delta == i.second.size()) {
+					insert_i = false;
+					break;
+				}
+				i.shrink_begin(delta);
+				continue;
+			}
+			// i fully overlaps j, split i into two parts, push the other part onto dedupe_list
+			if (j.second.begin() > i.second.begin() && j.second.end() < i.second.end()) {
+				auto other_i = i;
+				const auto end_left_delta = i.second.end() - j.second.begin();
+				const auto begin_right_delta = i.second.begin() - j.second.end();
+				i.shrink_end(end_left_delta);
+				other_i.shrink_begin(begin_right_delta);
+				dedupe_list.push_back(other_i);
+				continue;
+			}
+			// None of the sbove.  Oops!
+			THROW_CHECK0(runtime_error, false);
+		}
+		if (insert_i) {
+			dedupe_list_out.push_back(i);
+		}
+	}
+	dedupe_list = dedupe_list_out;
+	dedupe_list_out.clear();
+
+	// Count total dedupes
+	uint64_t bytes_deduped = 0;
+	for (const auto &i : dedupe_list) {
+		// Remove deduped blocks from insert map and zero map
+		for (off_t ip = i.second.begin(); ip < i.second.end(); ip += BLOCK_SIZE_SUMS) {
+			BEESCOUNT(scan_dup_block);
+			dedupe_set.insert(ip);
+			zero_set.erase(ip);
+		}
+		bytes_deduped += i.second.size();
 	}

-	// If we need to replace part of the extent, rewrite all instances of it
-	if (rewrite_extent) {
-		bool blocks_rewritten = false;
+	// Copy all blocks of the extent that were not deduped or zero, but don't copy an entire extent
+	uint64_t bytes_zeroed = 0;
+	if (!force_insert) {
 		BEESTRACE("Rewriting extent " << e);
 		off_t last_p = e.begin();
 		off_t p = last_p;
-		off_t next_p;
+		off_t next_p = last_p;
 		BEESTRACE("next_p " << to_hex(next_p) << " p " << to_hex(p) << " last_p " << to_hex(last_p));
 		for (next_p = e.begin(); next_p < e.end(); ) {
 			p = next_p;
-			next_p += BLOCK_SIZE_SUMS;
+			next_p = min(next_p + BLOCK_SIZE_SUMS, e.end());

-			// BEESLOG("noinsert_set.count(" << to_hex(p) << ") " << noinsert_set.count(p));
-			if (noinsert_set.count(p)) {
+			// Can't be both dedupe and zero
+			THROW_CHECK2(runtime_error, zero_set.count(p), dedupe_set.count(p), zero_set.count(p) + dedupe_set.count(p) < 2);
+			if (zero_set.count(p)) {
+				bytes_zeroed += next_p - p;
+			}
+			// BEESLOG("dedupe_set.count(" << to_hex(p) << ") " << dedupe_set.count(p));
+			if (dedupe_set.count(p)) {
 				if (p - last_p > 0) {
-					rewrite_file_range(BeesFileRange(bfr.fd(), last_p, p));
-					blocks_rewritten = true;
+					THROW_CHECK2(runtime_error, p, e.end(), p <= e.end());
+					copy_list.push_back(BeesFileRange(bfr.fd(), last_p, p));
 				}
 				last_p = next_p;
-			} else {
-				off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
-				bar.at(bar_p) = '+';
 			}
 		}
 		BEESTRACE("last");
-		if (next_p - last_p > 0) {
-			rewrite_file_range(BeesFileRange(bfr.fd(), last_p, next_p));
-			blocks_rewritten = true;
-		}
-		if (blocks_rewritten) {
-			// Nothing left to insert, all blocks clobbered
-			insert_map.clear();
-		} else {
-			// BEESLOG("No blocks rewritten");
-			BEESCOUNT(scan_no_rewrite);
+		if (next_p > last_p) {
+			THROW_CHECK2(runtime_error, next_p, e.end(), next_p <= e.end());
+			copy_list.push_back(BeesFileRange(bfr.fd(), last_p, next_p));
 		}
 	}

-	// We did not rewrite the extent and it contained data, so insert it.
-	for (auto i : insert_map) {
-		off_t bar_p = (i.first - e.begin()) / BLOCK_SIZE_SUMS;
-		BEESTRACE("e " << e << "bar_p = " << bar_p << " i.first-e.begin() " << i.first - e.begin() << " i.second " << i.second.first << ", " << i.second.second);
-		if (noinsert_set.count(i.first)) {
-			// FIXME:  we removed one reference to this copy.  Avoid thrashing?
-			hash_table->erase_hash_addr(i.second.first, i.second.second);
-			// Block was clobbered, do not insert
-			// Will look like 'Ddddd' because we skip deduped blocks
-			bar.at(bar_p) = 'D';
-			BEESCOUNT(inserted_clobbered);
+	// Don't copy an entire extent
+	if (!bytes_zeroed && copy_list.size() == 1 && copy_list.begin()->size() == e.size()) {
+		copy_list.clear();
+	}
+
+	// Count total copies
+	uint64_t bytes_copied = 0;
+	for (const auto &i : copy_list) {
+		bytes_copied += i.size();
+	}
+
+	BEESTRACE("bar: " << bar);
+
+	// Don't do nuisance dedupes part 1:  free more blocks than we create
+	THROW_CHECK3(runtime_error, bytes_copied, bytes_zeroed, bytes_deduped, bytes_copied >= bytes_zeroed);
+	const auto cost_copy = bytes_copied - bytes_zeroed;
+	const auto gain_dedupe = bytes_deduped + bytes_zeroed;
+	if (cost_copy > gain_dedupe) {
+		BEESLOGDEBUG("Too many bytes copied (" << pretty(bytes_copied) << ") for bytes deduped (" << pretty(bytes_deduped) << ") and holes punched (" << pretty(bytes_zeroed) << "), skipping extent");
+		BEESCOUNT(scan_skip_bytes);
+		force_insert = true;
+	}
+
+	// Don't do nuisance dedupes part 2:  nobody needs more than 100 dedupe/copy ops in one extent
+	if (dedupe_list.size() + copy_list.size() > 100) {
+		BEESLOGDEBUG("Too many dedupe (" << dedupe_list.size() << ") and copy (" << copy_list.size() << ") operations, skipping extent");
+		BEESCOUNT(scan_skip_ops);
+		force_insert = true;
+	}
+
+	// Track whether we rewrote anything
+	bool extent_modified = false;
+
+	// If we didn't delete the dedupe list, do the dedupes now
+	for (const auto &i : dedupe_list) {
+		BEESNOTE("dedup " << i);
+		if (force_insert || m_ctx->dedup(i)) {
+			BEESCOUNT(replacedst_dedup_hit);
+			THROW_CHECK0(runtime_error, i.second);
+			for (off_t ip = i.second.begin(); ip < i.second.end(); ip += BLOCK_SIZE_SUMS) {
+				if (ip >= e.begin() && ip < e.end()) {
+					off_t bar_p = (ip - e.begin()) / BLOCK_SIZE_SUMS;
+					if (bar.at(bar_p) != '=') {
+						if (ip == i.second.begin()) {
+							bar.at(bar_p) = '<';
+						} else if (ip + BLOCK_SIZE_SUMS >= i.second.end()) {
+							bar.at(bar_p) = '>';
+						} else {
+							bar.at(bar_p) = 'd';
+						}
+					}
+				}
+			}
+			extent_modified = !force_insert;
 		} else {
+			BEESLOGINFO("dedup failed: " << i);
+			BEESCOUNT(replacedst_dedup_miss);
+			// User data changed while we were looking up the extent, or we have a bug.
+			// We can't fix this, but we can immediately stop wasting effort.
+			return;
+		}
+	}
+
+	// Then the copy/rewrites
+	for (const auto &i : copy_list) {
+		if (!force_insert) {
+			rewrite_file_range(i);
+			extent_modified = true;
+		}
+		for (auto p = i.begin(); p < i.end(); p += BLOCK_SIZE_SUMS) {
+			off_t bar_p = (p - e.begin()) / BLOCK_SIZE_SUMS;
+			// Leave zeros as-is because they aren't really copies
+			if (bar.at(bar_p) != '0') {
+				bar.at(bar_p) = '+';
+			}
+		}
+	}
+
+	if (!force_insert) {
+		// Push matched hashes to front
+		for (const auto &i : front_hash_list) {
+			hash_table->push_front_hash_addr(i.first, i.second);
+			BEESCOUNT(scan_push_front);
+		}
+		// Invalidate cached resolves
+		for (const auto &i : invalidate_addr_list) {
+			m_ctx->invalidate_addr(i);
+		}
+	}
+
+	// Don't insert hashes pointing to an extent we just deleted
+	if (!extent_modified) {
+		// We did not rewrite the extent and it contained data, so insert it.
+		// BEESLOGDEBUG("Inserting " << insert_map.size() << " hashes from " << bfr);
+		for (const auto &i : insert_map) {
 			hash_table->push_random_hash_addr(i.second.first, i.second.second);
-			bar.at(bar_p) = '.';
-			BEESCOUNT(inserted_block);
+			off_t bar_p = (i.first - e.begin()) / BLOCK_SIZE_SUMS;
+			if (bar.at(bar_p) == 'i') {
+				bar.at(bar_p) = '.';
+			}
+			BEESCOUNT(scan_hash_insert);
 		}
 	}

 	// Visualize
 	if (bar != string(block_count, '.')) {
-		BEESLOGINFO("scan: " << pretty(e.size()) << " " << to_hex(e.begin()) << " [" << bar << "] " << to_hex(e.end()) << ' ' << name_fd(bfr.fd()));
+		BEESLOGINFO(
+			(force_insert ? "skip" : "scan") << ": "
+			<< pretty(e.size()) << " "
+			<< dedupe_list.size() << "d" << copy_list.size() << "c"
+			<< ((bytes_zeroed + BLOCK_SIZE_SUMS - 1) / BLOCK_SIZE_SUMS) << "p"
+			<< (extent_compressed ? "z " : " ")
+			<< one_timer << "s {"
+			<< to_hex(e.bytenr()) << "+" << to_hex(e.offset()) << "} "
+			<< to_hex(e.begin()) << " [" << bar << "] " << to_hex(e.end())
+			<< ' ' << name_fd(bfr.fd())
+		);
 	}

-	// Costs 10% on benchmarks
+	// Put this extent into the recently seen list if we didn't rewrite it,
+	// and remove it if we did.
+	lock_seen.lock();
+	if (extent_modified) {
+		s_seen.erase(tup);
+		BEESCOUNT(scan_seen_erase);
+	} else {
+		// BEESLOGDEBUG("Seen " << tup << " " << e);
+		s_seen.insert(tup);
+		BEESCOUNT(scan_seen_insert);
+	}
+	lock_seen.unlock();
+
+	// Now causes 75% loss of performance in benchmarks
 	// bees_unreadahead(bfr.fd(), bfr.begin(), bfr.size());
-	return bfr;
 }

 shared_ptr<Exclusion>
@@ -703,14 +913,14 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)
 	// No FD?  Well, that was quick.
 	if (!bfr.fd()) {
 		// BEESLOGINFO("No FD in " << root_path() << " for " << bfr);
-		BEESCOUNT(scan_no_fd);
+		BEESCOUNT(scanf_no_fd);
 		return false;
 	}

 	// Sanity check
 	if (bfr.begin() >= bfr.file_size()) {
-		BEESLOGWARN("past EOF: " << bfr);
-		BEESCOUNT(scan_eof);
+		BEESLOGDEBUG("past EOF: " << bfr);
+		BEESCOUNT(scanf_eof);
 		return false;
 	}

@@ -730,9 +940,11 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)
 					// BEESLOGDEBUG("Deferring extent bytenr " << to_hex(extent_bytenr) << " from " << bfr);
 					BEESCOUNT(scanf_deferred_extent);
 					start_over = true;
+					return; // from closure
 				}
 				Timer one_extent_timer;
 				scan_one_extent(bfr, e);
+				// BEESLOGDEBUG("Scanned " << e << " " << bfr);
 				BEESCOUNTADD(scanf_extent_ms, one_extent_timer.age() * 1000);
 				BEESCOUNT(scanf_extent);
 			});
@@ -784,9 +996,10 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
 	Timer resolve_timer;

 	struct rusage usage_before;
+	struct rusage usage_after;
 	{
 		BEESNOTE("waiting to resolve addr " << addr << " with LOGICAL_INO");
-		const auto lock = MultiLocker::get_lock("logical_ino");
+		auto lock = MultiLocker::get_lock("logical_ino");

 		// Get this thread's system CPU usage
 		DIE_IF_MINUS_ONE(getrusage(RUSAGE_THREAD, &usage_before));
@@ -800,13 +1013,13 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
 		} else {
 			BEESCOUNT(resolve_fail);
 		}
-		BEESCOUNTADD(resolve_ms, resolve_timer.age() * 1000);
+		DIE_IF_MINUS_ONE(getrusage(RUSAGE_THREAD, &usage_after));
+		const auto resolve_timer_age = resolve_timer.age();
+		BEESCOUNTADD(resolve_ms, resolve_timer_age * 1000);
+		lock.reset();
+		bees_throttle(resolve_timer_age, "resolve_addr");
 	}

-	// Again!
-	struct rusage usage_after;
-	DIE_IF_MINUS_ONE(getrusage(RUSAGE_THREAD, &usage_after));
-
 	const double sys_usage_delta =
 		(usage_after.ru_stime.tv_sec + usage_after.ru_stime.tv_usec / 1000000.0) -
 		(usage_before.ru_stime.tv_sec + usage_before.ru_stime.tv_usec / 1000000.0);
@@ -925,7 +1138,8 @@ BeesContext::start()
 		return make_shared<BeesTempFile>(shared_from_this());
 	});
 	m_logical_ino_pool.generator([]() {
-		return make_shared<BtrfsIoctlLogicalInoArgs>(0);
+		const auto extent_ref_size = sizeof(uint64_t) * 3;
+		return make_shared<BtrfsIoctlLogicalInoArgs>(0, BEES_MAX_EXTENT_REF_COUNT * extent_ref_size + sizeof(btrfs_data_container));
 	});
 	m_tmpfile_pool.checkin([](const shared_ptr<BeesTempFile> &btf) {
 		catch_all([&](){
@@ -356,6 +356,8 @@ BeesHashTable::prefetch_loop()
 		auto avg_rates = thisStats / m_ctx->total_timer().age();
 		graph_blob << "\t" << avg_rates << "\n";

+		graph_blob << m_ctx->get_progress();
+
 		BEESLOGINFO(graph_blob.str());
 		catch_all([&]() {
 			m_stats_file.write(graph_blob.str());
@@ -446,10 +448,38 @@ BeesHashTable::fetch_missing_extent_by_index(uint64_t extent_index)

 		// If we are in prefetch, give the kernel a hint about the next extent
 		if (m_prefetch_running) {
-			// XXX: don't call this if bees_readahead is implemented by pread()
-			bees_readahead(m_fd, dirty_extent_offset + dirty_extent_size, dirty_extent_size);
+			// Use the kernel readahead here, because it might work for this use case
+			readahead(m_fd, dirty_extent_offset + dirty_extent_size, dirty_extent_size);
 		}
 	});
+
+	Cell *cell     = m_extent_ptr[extent_index    ].p_buckets[0].p_cells;
+	Cell *cell_end = m_extent_ptr[extent_index + 1].p_buckets[0].p_cells;
+	size_t toxic_cleared_count = 0;
+	set<BeesHashTable::Cell> seen_it(cell, cell_end);
+	while (cell < cell_end) {
+		if (cell->e_addr & BeesAddress::c_toxic_mask) {
+			++toxic_cleared_count;
+			cell->e_addr &= ~BeesAddress::c_toxic_mask;
+			// Clearing the toxic bit might mean we now have a duplicate.
+			// This could be due to a race between two
+			// inserts, one finds the extent toxic while the
+			// other does not.  That's arguably a bug elsewhere,
+			// but we should rewrite the whole extent lookup/insert
+			// loop, not spend time fixing code that will be
+			// thrown out later anyway.
+			// If there is a cell that is identical to this one
+			// except for the toxic bit, then we don't need this one.
+			if (seen_it.count(*cell)) {
+				cell->e_addr = 0;
+				cell->e_hash = 0;
+			}
+		}
+		++cell;
+	}
+	if (toxic_cleared_count) {
+		BEESLOGDEBUG("Cleared " << toxic_cleared_count << " hashes while fetching hash table extent " << extent_index);
+	}
 }

 void
@@ -767,7 +797,7 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
 	for (auto fp = madv_flags; fp->value; ++fp) {
 		BEESTOOLONG("madvise(" << fp->name << ")");
 		if (madvise(m_byte_ptr, m_size, fp->value)) {
-			BEESLOGWARN("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
+			BEESLOGNOTICE("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
 		}
 	}

@@ -781,8 +811,19 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
 		prefetch_loop();
        });

-	// Blacklist might fail if the hash table is not stored on a btrfs
+	// Blacklist might fail if the hash table is not stored on a btrfs,
+	// or if it's on a _different_ btrfs
 	catch_all([&]() {
+		// Root is definitely a btrfs
+		BtrfsIoctlFsInfoArgs root_info;
+		root_info.do_ioctl(m_ctx->root_fd());
+		// Hash might not be a btrfs
+		BtrfsIoctlFsInfoArgs hash_info;
+		// If btrfs fs_info ioctl fails, it must be a different fs
+		if (!hash_info.do_ioctl_nothrow(m_fd)) return;
+		// If Hash is a btrfs, Root must be the same one
+		if (root_info.fsid() != hash_info.fsid()) return;
+		// Hash is on the same one, blacklist it
 		m_ctx->blacklist_insert(BeesFileId(m_fd));
 	});
 }
@@ -384,7 +384,7 @@ BeesResolver::for_each_extent_ref(BeesBlockData bbd, function<bool(const BeesFil
 	return stop_now;
 }

-BeesFileRange
+BeesRangePair
 BeesResolver::replace_dst(const BeesFileRange &dst_bfr_in)
 {
 	BEESTRACE("replace_dst dst_bfr " << dst_bfr_in);
@@ -400,6 +400,7 @@ BeesResolver::replace_dst(const BeesFileRange &dst_bfr_in)
 	BEESTRACE("overlap_bfr " << overlap_bfr);

 	BeesBlockData bbd(dst_bfr);
+	BeesRangePair rv = { BeesFileRange(), BeesFileRange() };

 	for_each_extent_ref(bbd, [&](const BeesFileRange &src_bfr_in) -> bool {
 		// Open src
@@ -436,21 +437,12 @@ BeesResolver::replace_dst(const BeesFileRange &dst_bfr_in)
 			BEESCOUNT(replacedst_grown);
 		}

-		// Dedup
-		BEESNOTE("dedup " << brp);
-		if (m_ctx->dedup(brp)) {
-			BEESCOUNT(replacedst_dedup_hit);
-			m_found_dup = true;
-			overlap_bfr = brp.second;
-			// FIXME:  find best range first, then dedupe that
-			return true; // i.e. break
-		} else {
-			BEESCOUNT(replacedst_dedup_miss);
-			return false; // i.e. continue
-		}
+		rv = brp;
+		m_found_dup = true;
+		return true;
 	});
 	// BEESLOG("overlap_bfr after " << overlap_bfr);
-	return overlap_bfr.copy_closed();
+	return rv;
 }

 BeesFileRange
@@ -8,38 +8,32 @@ thread_local BeesTracer *BeesTracer::tl_next_tracer = nullptr;
 thread_local bool BeesTracer::tl_first = true;
 thread_local bool BeesTracer::tl_silent = false;

+bool
+exception_check()
+{
 #if __cplusplus >= 201703
-static
-bool
-exception_check()
-{
 	return uncaught_exceptions();
-}
 #else
-static
-bool
-exception_check()
-{
 	return uncaught_exception();
-}
 #endif
+}

 BeesTracer::~BeesTracer()
 {
 	if (!tl_silent && exception_check()) {
 		if (tl_first) {
-			BEESLOGNOTICE("--- BEGIN TRACE --- exception ---");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE --- exception ---");
 			tl_first = false;
 		}
 		try {
 			m_func();
 		} catch (exception &e) {
-			BEESLOGNOTICE("Nested exception: " << e.what());
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception: " << e.what());
 		} catch (...) {
-			BEESLOGNOTICE("Nested exception ...");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception ...");
 		}
 		if (!m_next_tracer) {
-			BEESLOGNOTICE("---  END  TRACE --- exception ---");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: ---  END  TRACE --- exception ---");
 		}
 	}
 	tl_next_tracer = m_next_tracer;
@@ -49,7 +43,7 @@ BeesTracer::~BeesTracer()
 	}
 }

-BeesTracer::BeesTracer(function<void()> f, bool silent) :
+BeesTracer::BeesTracer(const function<void()> &f, bool silent) :
 	m_func(f)
 {
 	m_next_tracer = tl_next_tracer;
@@ -61,12 +55,12 @@ void
 BeesTracer::trace_now()
 {
 	BeesTracer *tp = tl_next_tracer;
-	BEESLOGNOTICE("--- BEGIN TRACE ---");
+	BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE ---");
 	while (tp) {
 		tp->m_func();
 		tp = tp->m_next_tracer;
 	}
-	BEESLOGNOTICE("---  END  TRACE ---");
+	BEESLOG(BEES_TRACE_LEVEL, "TRACE: ---  END  TRACE ---");
 }

 bool
@@ -91,9 +85,9 @@ BeesNote::~BeesNote()
 	tl_next = m_prev;
 	unique_lock<mutex> lock(s_mutex);
 	if (tl_next) {
-		s_status[crucible::gettid()] = tl_next;
+		s_status[gettid()] = tl_next;
 	} else {
-		s_status.erase(crucible::gettid());
+		s_status.erase(gettid());
 	}
 }

@@ -104,7 +98,7 @@ BeesNote::BeesNote(function<void(ostream &os)> f) :
 	m_prev = tl_next;
 	tl_next = this;
 	unique_lock<mutex> lock(s_mutex);
-	s_status[crucible::gettid()] = tl_next;
+	s_status[gettid()] = tl_next;
 }

 void
@@ -183,6 +183,24 @@ BeesFileRange::grow_begin(off_t delta)
 	return m_begin;
 }

+off_t
+BeesFileRange::shrink_begin(off_t delta)
+{
+	THROW_CHECK1(invalid_argument, delta, delta > 0);
+	THROW_CHECK3(invalid_argument, delta, m_begin, m_end, delta + m_begin < m_end);
+	m_begin += delta;
+	return m_begin;
+}
+
+off_t
+BeesFileRange::shrink_end(off_t delta)
+{
+	THROW_CHECK1(invalid_argument, delta, delta > 0);
+	THROW_CHECK2(invalid_argument, delta, m_end, m_end >= delta);
+	m_end -= delta;
+	return m_end;
+}
+
 BeesFileRange::BeesFileRange(const BeesBlockData &bbd) :
 	m_fd(bbd.fd()),
 	m_begin(bbd.begin()),
@@ -349,8 +367,8 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 	BEESTRACE("e_second " << e_second);

 	// Preread entire extent
-	bees_readahead(second.fd(), e_second.begin(), e_second.size());
-	bees_readahead(first.fd(), e_second.begin() + first.begin() - second.begin(), e_second.size());
+	bees_readahead_pair(second.fd(), e_second.begin(), e_second.size(),
+			    first.fd(), e_second.begin() + first.begin() - second.begin(), e_second.size());

 	auto hash_table = ctx->hash_table();

@@ -388,17 +406,6 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 			break;
 		}

-		// Source extent cannot be toxic
-		BeesAddress first_addr(first.fd(), new_first.begin());
-		if (!first_addr.is_magic()) {
-			auto first_resolved = ctx->resolve_addr(first_addr);
-			if (first_resolved.is_toxic()) {
-				BEESLOGWARN("WORKAROUND: not growing matching pair backward because src addr is toxic:\n" << *this);
-				BEESCOUNT(pairbackward_toxic_addr);
-				break;
-			}
-		}
-
 		// Extend second range.  If we hit BOF we can go no further.
 		BeesFileRange new_second = second;
 		BEESTRACE("new_second = " << new_second);
@@ -434,6 +441,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 		}

 		// Source block cannot be zero in a non-compressed non-magic extent
+		BeesAddress first_addr(first.fd(), new_first.begin());
 		if (first_bbd.is_data_zero() && !first_addr.is_magic() && !first_addr.is_compressed()) {
 			BEESCOUNT(pairbackward_zero);
 			break;
@@ -449,7 +457,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 			}
 		}
 		if (found_toxic) {
-			BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
+			BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
 			BEESCOUNT(pairbackward_toxic_hash);
 			break;
 		}
@@ -491,17 +499,6 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 			break;
 		}

-		// Source extent cannot be toxic
-		BeesAddress first_addr(first.fd(), new_first.begin());
-		if (!first_addr.is_magic()) {
-			auto first_resolved = ctx->resolve_addr(first_addr);
-			if (first_resolved.is_toxic()) {
-				BEESLOGWARN("WORKAROUND: not growing matching pair forward because src is toxic:\n" << *this);
-				BEESCOUNT(pairforward_toxic);
-				break;
-			}
-		}
-
 		// Extend second range.  If we hit EOF we can go no further.
 		BeesFileRange new_second = second;
 		BEESTRACE("new_second = " << new_second);
@@ -545,6 +542,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 		}

 		// Source block cannot be zero in a non-compressed non-magic extent
+		BeesAddress first_addr(first.fd(), new_first.begin());
 		if (first_bbd.is_data_zero() && !first_addr.is_magic() && !first_addr.is_compressed()) {
 			BEESCOUNT(pairforward_zero);
 			break;
@@ -560,7 +558,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 			}
 		}
 		if (found_toxic) {
-			BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
+			BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
 			BEESCOUNT(pairforward_toxic_hash);
 			break;
 		}
@@ -574,7 +572,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 	}

 	if (first.overlaps(second)) {
-		BEESLOGTRACE("after grow, first " << first << "\n\toverlaps " << second);
+		BEESLOGDEBUG("after grow, first " << first << "\n\toverlaps " << second);
 		BEESCOUNT(bug_grow_pair_overlaps);
 	}

@@ -589,6 +587,22 @@ BeesRangePair::copy_closed() const
 	return BeesRangePair(first.copy_closed(), second.copy_closed());
 }

+void
+BeesRangePair::shrink_begin(off_t const delta)
+{
+	first.shrink_begin(delta);
+	second.shrink_begin(delta);
+	THROW_CHECK2(runtime_error, first.size(), second.size(), first.size() == second.size());
+}
+
+void
+BeesRangePair::shrink_end(off_t const delta)
+{
+	first.shrink_end(delta);
+	second.shrink_end(delta);
+	THROW_CHECK2(runtime_error, first.size(), second.size(), first.size() == second.size());
+}
+
 ostream &
 operator<<(ostream &os, const BeesAddress &ba)
 {
@@ -660,7 +674,7 @@ BeesAddress::magic_check(uint64_t flags)
 	static const unsigned recognized_flags = compressed_flags | delalloc_flags | ignore_flags | unusable_flags;

 	if (flags & ~recognized_flags) {
-		BEESLOGTRACE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
+		BEESLOGNOTICE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
 		m_addr = UNUSABLE;
 		// maybe we throw here?
 		BEESCOUNT(addr_unrecognized);
@@ -12,9 +12,10 @@ Load management options:
    -C, --thread-factor   Worker thread factor (default 1)
    -G, --thread-min      Minimum worker thread count (default 0)
    -g, --loadavg-target  Target load average for worker threads (default none)
+        --throttle-factor Idle time between operations (default 1.0)

 Filesystem tree traversal options:
-    -m, --scan-mode       Scanning mode (0..2, default 0)
+    -m, --scan-mode       Scanning mode (0..4, default 4)

 Workarounds:
    -a, --workaround-btrfs-send    Workaround for btrfs send
@@ -4,6 +4,7 @@
 #include "crucible/process.h"
 #include "crucible/string.h"
 #include "crucible/task.h"
+#include "crucible/uname.h"

 #include <cctype>
 #include <cmath>
@@ -11,17 +12,19 @@

 #include <iostream>
 #include <memory>
+#include <regex>
 #include <sstream>

 // PRIx64
 #include <inttypes.h>

-#include <sched.h>
-#include <sys/fanotify.h>
-
 #include <linux/fs.h>
 #include <sys/ioctl.h>

+// statfs
+#include <linux/magic.h>
+#include <sys/statfs.h>
+
 // setrlimit
 #include <sys/time.h>
 #include <sys/resource.h>
@@ -198,7 +201,7 @@ BeesTooLong::check() const
 	if (age() > m_limit) {
 		ostringstream oss;
 		m_func(oss);
-		BEESLOGWARN("PERFORMANCE: " << *this << " sec: " << oss.str());
+		BEESLOGINFO("PERFORMANCE: " << *this << " sec: " << oss.str());
 	}
 }

@@ -214,21 +217,41 @@ BeesTooLong::operator=(const func_type &f)
 	return *this;
 }

-void
-bees_readahead(int const fd, const off_t offset, const size_t size)
+static
+bool
+bees_readahead_check(int const fd, off_t const offset, size_t const size)
 {
+	// FIXME: the rest of the code calls this function more often than necessary,
+	// usually back-to-back calls on the same range in a loop.
+	// Simply discard requests that are identical to recent requests.
+	const Stat stat_rv(fd);
+	auto tup = make_tuple(offset, size, stat_rv.st_dev, stat_rv.st_ino);
+	static mutex s_recent_mutex;
+	static set<decltype(tup)> s_recent;
+	unique_lock<mutex> lock(s_recent_mutex);
+	if (s_recent.size() > BEES_MAX_EXTENT_REF_COUNT) {
+		s_recent.clear();
+		BEESCOUNT(readahead_clear);
+	}
+	const auto rv = s_recent.insert(tup);
+	// If we recently did this readahead, we're done here
+	if (!rv.second) {
+		BEESCOUNT(readahead_skip);
+	}
+	return rv.second;
+}
+
+static
+void
+bees_readahead_nolock(int const fd, const off_t offset, const size_t size)
+{
+	if (!bees_readahead_check(fd, offset, size)) return;
 	Timer readahead_timer;
 	BEESNOTE("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
 	BEESTOOLONG("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
-#if 0
-	// In the kernel, readahead() is identical to posix_fadvise(..., POSIX_FADV_DONTNEED)
-	DIE_IF_NON_ZERO(readahead(fd, offset, size));
-#else
 	// Make sure this data is in page cache by brute force
-	// This isn't necessary and it might even be slower,
-	// but the btrfs kernel code does readahead with lower ioprio
-	// and might discard the readahead request entirely,
-	// so it's maybe, *maybe*, worth doing both.
+	// The btrfs kernel code does readahead with lower ioprio
+	// and might discard the readahead request entirely.
 	BEESNOTE("emulating readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
 	auto working_size = size;
 	auto working_offset = offset;
@@ -239,16 +262,41 @@ bees_readahead(int const fd, const off_t offset, const size_t size)
 		// Ignore errors and short reads.  It turns out our size
 		// parameter isn't all that accurate, so we can't use
 		// the pread_or_die template.
-		(void)!pread(fd, dummy, this_read_size, working_offset);
-		BEESCOUNT(readahead_count);
-		BEESCOUNTADD(readahead_bytes, this_read_size);
+		const auto pr_rv = pread(fd, dummy, this_read_size, working_offset);
+		if (pr_rv >= 0) {
+			BEESCOUNT(readahead_count);
+			BEESCOUNTADD(readahead_bytes, pr_rv);
+		} else {
+			BEESCOUNT(readahead_fail);
+		}
 		working_offset += this_read_size;
 		working_size -= this_read_size;
 	}
-#endif
 	BEESCOUNTADD(readahead_ms, readahead_timer.age() * 1000);
 }

+static mutex s_only_one;
+
+void
+bees_readahead_pair(int fd, off_t offset, size_t size, int fd2, off_t offset2, size_t size2)
+{
+	if (!bees_readahead_check(fd, offset, size) && !bees_readahead_check(fd2, offset2, size2)) return;
+	BEESNOTE("waiting to readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size) << ","
+		<< "\n\t" << name_fd(fd2) << " offset " << to_hex(offset2) << " len " << pretty(size2));
+	unique_lock<mutex> m_lock(s_only_one);
+	bees_readahead_nolock(fd, offset, size);
+	bees_readahead_nolock(fd2, offset2, size2);
+}
+
+void
+bees_readahead(int const fd, const off_t offset, const size_t size)
+{
+	if (!bees_readahead_check(fd, offset, size)) return;
+	BEESNOTE("waiting to readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
+	unique_lock<mutex> m_lock(s_only_one);
+	bees_readahead_nolock(fd, offset, size);
+}
+
 void
 bees_unreadahead(int const fd, off_t offset, size_t size)
 {
@@ -259,6 +307,48 @@ bees_unreadahead(int const fd, off_t offset, size_t size)
 	BEESCOUNTADD(readahead_unread_ms, unreadahead_timer.age() * 1000);
 }

+static double bees_throttle_factor = 0.0;
+
+void
+bees_throttle(const double time_used, const char *const context)
+{
+	static mutex s_mutex;
+	unique_lock<mutex> throttle_lock(s_mutex);
+	struct time_pair {
+		double time_used = 0;
+		double time_count = 0;
+		double longest_sleep_time = 0;
+	};
+	static map<string, time_pair> s_time_map;
+	auto &this_time = s_time_map[context];
+	auto &this_time_used = this_time.time_used;
+	auto &this_time_count = this_time.time_count;
+	auto &longest_sleep_time = this_time.longest_sleep_time;
+	this_time_used += time_used;
+	++this_time_count;
+	// Keep the timing data fresh
+	static Timer s_fresh_timer;
+	if (s_fresh_timer.age() > 60) {
+		s_fresh_timer.reset();
+		this_time_count *= 0.9;
+		this_time_used *= 0.9;
+	}
+	// Wait for enough data to calculate rates
+	if (this_time_used < 1.0 || this_time_count < 1.0) return;
+	const auto avg_time = this_time_used / this_time_count;
+	const auto sleep_time = min(60.0, bees_throttle_factor * avg_time - time_used);
+	if (sleep_time <= 0) {
+		return;
+	}
+	if (sleep_time > longest_sleep_time) {
+		BEESLOGDEBUG(context << ": throttle delay " << sleep_time << " s, time used " << time_used << " s, avg time " << avg_time << " s");
+		longest_sleep_time = sleep_time;
+	}
+	throttle_lock.unlock();
+	BEESNOTE(context << ": throttle delay " << sleep_time << " s, time used " << time_used << " s, avg time " << avg_time << " s");
+	nanosleep(sleep_time);
+}
+
 thread_local random_device bees_random_device;
 thread_local uniform_int_distribution<default_random_engine::result_type> bees_random_seed_dist(
 	numeric_limits<default_random_engine::result_type>::min(),
@@ -304,6 +394,73 @@ BeesStringFile::read()
 	return read_string(fd, st.st_size);
 }

+static
+void
+bees_fsync(int const fd)
+{
+
+	// Note that when btrfs renames a temporary over an existing file,
+	// it flushes the temporary, so we get the right behavior if we
+	// just do nothing here (except when the file is first created;
+	// however, in that case the result is the same as if the file
+	// did not exist, was empty, or was filled with garbage).
+	//
+	// Kernel versions prior to 5.16 had bugs which would put ghost
+	// dirents in $BEESHOME if there was a crash when we called
+	// fsync() here.
+	//
+	// Some other filesystems will throw our data away if we don't
+	// call fsync, so we do need to call fsync() on those filesystems.
+	//
+	// Newer btrfs kernel versions rely on fsync() to report
+	// unrecoverable write errors.	If we don't check the fsync()
+	// result, we'll lose the data when we rename().  Kernel 6.2 added
+	// a number of new root causes for the class of "unrecoverable
+	// write errors" so we need to check this now.
+
+	BEESNOTE("checking filesystem type for " << name_fd(fd));
+	// LSB deprecated statfs without providing a replacement that
+	// can fill in the f_type field.
+	struct statfs stf = { 0 };
+	DIE_IF_NON_ZERO(fstatfs(fd, &stf));
+	if (static_cast<decltype(BTRFS_SUPER_MAGIC)>(stf.f_type) != BTRFS_SUPER_MAGIC) {
+		BEESLOGONCE("Using fsync on non-btrfs filesystem type " << to_hex(stf.f_type));
+		BEESNOTE("fsync non-btrfs " << name_fd(fd));
+		DIE_IF_NON_ZERO(fsync(fd));
+		return;
+	}
+
+	static bool did_uname = false;
+	static bool do_fsync = false;
+
+	if (!did_uname) {
+		Uname uname;
+		const string version(uname.release);
+		static const regex version_re(R"/(^(\d+)\.(\d+)\.)/", regex::optimize | regex::ECMAScript);
+		smatch m;
+		// Last known bug in the fsync-rename use case was fixed in kernel 5.16
+		static const auto min_major = 5, min_minor = 16;
+		if (regex_search(version, m, version_re)) {
+			const auto major = stoul(m[1]);
+			const auto minor = stoul(m[2]);
+			if (tie(major, minor) > tie(min_major, min_minor)) {
+				BEESLOGONCE("Using fsync on btrfs because kernel version is " << major << "." << minor);
+				do_fsync = true;
+			} else {
+				BEESLOGONCE("Not using fsync on btrfs because kernel version is " << major << "." << minor);
+			}
+		} else {
+			BEESLOGONCE("Not using fsync on btrfs because can't parse kernel version '" << version << "'");
+		}
+		did_uname = true;
+	}
+
+	if (do_fsync) {
+		BEESNOTE("fsync btrfs " << name_fd(fd));
+		DIE_IF_NON_ZERO(fsync(fd));
+	}
+}
+
 void
 BeesStringFile::write(string contents)
 {
@@ -319,19 +476,8 @@ BeesStringFile::write(string contents)
 		Fd ofd = openat_or_die(m_dir_fd, tmpname, FLAGS_CREATE_FILE, S_IRUSR | S_IWUSR);
 		BEESNOTE("writing " << tmpname << " in " << name_fd(m_dir_fd));
 		write_or_die(ofd, contents);
-#if 0
-		// This triggers too many btrfs bugs.  I wish I was kidding.
-		// Forget snapshots, balance, compression, and dedupe:
-		// the system call you have to fear on btrfs is fsync().
-		// Also note that when bees renames a temporary over an
-		// existing file, it flushes the temporary, so we get
-		// the right behavior if we just do nothing here
-		// (except when the file is first created; however,
-		// in that case the result is the same as if the file
-		// did not exist, was empty, or was filled with garbage).
 		BEESNOTE("fsyncing " << tmpname << " in " << name_fd(m_dir_fd));
-		DIE_IF_NON_ZERO(fsync(ofd));
-#endif
+		bees_fsync(ofd);
 	}
 	BEESNOTE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
 	BEESTRACE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
@@ -355,6 +501,25 @@ BeesTempFile::resize(off_t offset)

 	// Count time spent here
 	BEESCOUNTADD(tmp_resize_ms, resize_timer.age() * 1000);
+
+	// Modify flags - every time
+	// - btrfs will keep trying to set FS_NOCOMP_FL behind us when compression heuristics identify
+	//   the data as compressible, but it fails to compress
+	// - clear FS_NOCOW_FL because we can only dedupe between files with the same FS_NOCOW_FL state,
+	//   and we don't open FS_NOCOW_FL files for dedupe.
+	BEESTRACE("Getting FS_COMPR_FL and FS_NOCOMP_FL on m_fd " << name_fd(m_fd));
+	int flags = ioctl_iflags_get(m_fd);
+	const auto orig_flags = flags;
+
+	flags |= FS_COMPR_FL;
+	flags &= ~(FS_NOCOMP_FL | FS_NOCOW_FL);
+	if (flags != orig_flags) {
+		BEESTRACE("Setting FS_COMPR_FL and clearing FS_NOCOMP_FL | FS_NOCOW_FL on m_fd " << name_fd(m_fd) << " flags " << to_hex(flags));
+		ioctl_iflags_set(m_fd, flags);
+	}
+
+	// That may have queued some delayed ref deletes, so throttle them
+	bees_throttle(resize_timer.age(), "tmpfile_resize");
 }

 void
@@ -395,13 +560,6 @@ BeesTempFile::BeesTempFile(shared_ptr<BeesContext> ctx) :
 	// Add this file to open_root_ino lookup table
 	m_roots->insert_tmpfile(m_fd);

-	// Set compression attribute
-	BEESTRACE("Getting FS_COMPR_FL on m_fd " << name_fd(m_fd));
-	int flags = ioctl_iflags_get(m_fd);
-	flags |= FS_COMPR_FL;
-	BEESTRACE("Setting FS_COMPR_FL on m_fd " << name_fd(m_fd) << " flags " << to_hex(flags));
-	ioctl_iflags_set(m_fd, flags);
-
 	// Count time spent here
 	BEESCOUNTADD(tmp_create_ms, create_timer.age() * 1000);

@@ -490,6 +648,8 @@ BeesTempFile::make_copy(const BeesFileRange &src)
 	}
 	BEESCOUNTADD(tmp_copy_ms, copy_timer.age() * 1000);

+	bees_throttle(copy_timer.age(), "tmpfile_copy");
+
 	BEESCOUNT(tmp_copy);
 	return rv;
 }
@@ -528,19 +688,23 @@ operator<<(ostream &os, const siginfo_t &si)

 static sigset_t new_sigset, old_sigset;

+static
 void
-block_term_signal()
+block_signals()
 {
 	BEESLOGDEBUG("Masking signals");

 	DIE_IF_NON_ZERO(sigemptyset(&new_sigset));
 	DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGTERM));
 	DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGINT));
+	DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGUSR1));
+	DIE_IF_NON_ZERO(sigaddset(&new_sigset, SIGUSR2));
 	DIE_IF_NON_ZERO(sigprocmask(SIG_BLOCK, &new_sigset, &old_sigset));
 }

+static
 void
-wait_for_term_signal()
+wait_for_signals()
 {
 	BEESNOTE("waiting for signals");
 	BEESLOGDEBUG("Waiting for signals...");
@@ -557,14 +721,28 @@ wait_for_term_signal()
 			THROW_ERRNO("sigwaitinfo errno = " << errno);
 		} else {
 			BEESLOGNOTICE("Received signal " << rv << " info " << info);
-			// Unblock so we die immediately if signalled again
-			DIE_IF_NON_ZERO(sigprocmask(SIG_BLOCK, &old_sigset, &new_sigset));
-			break;
+			// If SIGTERM or SIGINT, unblock so we die immediately if signalled again
+			switch (info.si_signo) {
+				case SIGUSR1:
+					BEESLOGNOTICE("Received SIGUSR1 - pausing workers");
+					TaskMaster::pause(true);
+					break;
+				case SIGUSR2:
+					BEESLOGNOTICE("Received SIGUSR2 - unpausing workers");
+					TaskMaster::pause(false);
+					break;
+				case SIGTERM:
+				case SIGINT:
+				default:
+					DIE_IF_NON_ZERO(sigprocmask(SIG_BLOCK, &old_sigset, &new_sigset));
+					BEESLOGDEBUG("Signal catcher exiting");
+					return;
+			}
 		}
 	}
-	BEESLOGDEBUG("Signal catcher exiting");
 }

+static
 int
 bees_main(int argc, char *argv[])
 {
@@ -573,7 +751,7 @@ bees_main(int argc, char *argv[])
 			BEESLOGDEBUG("exception (ignored): " << s);
 			BEESCOUNT(exception_caught_silent);
 		} else {
-			BEESLOGNOTICE("\n\n*** EXCEPTION ***\n\t" << s << "\n***\n");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: EXCEPTION: " << s);
 			BEESCOUNT(exception_caught);
 		}
 	});
@@ -588,47 +766,51 @@ bees_main(int argc, char *argv[])

 	// Have to block signals now before we create a bunch of threads
 	// so the threads will also have the signals blocked.
-	block_term_signal();
+	block_signals();

 	// Create a context so we can apply configuration to it
 	shared_ptr<BeesContext> bc = make_shared<BeesContext>();
 	BEESLOGDEBUG("context constructed");

-	string cwd(readlink_or_die("/proc/self/cwd"));
-
 	// Defaults
+	bool use_relative_paths = false;
 	bool chatter_prefix_timestamp = true;
 	double thread_factor = 0;
 	unsigned thread_count = 0;
 	unsigned thread_min = 0;
 	double load_target = 0;
 	bool workaround_btrfs_send = false;
-	BeesRoots::ScanMode root_scan_mode = BeesRoots::SCAN_MODE_INDEPENDENT;
+	BeesRoots::ScanMode root_scan_mode = BeesRoots::SCAN_MODE_EXTENT;

 	// Configure getopt_long
+	// Options with no short form
+	enum {
+		BEES_OPT_THROTTLE_FACTOR = 256,
+	};
 	static const struct option long_options[] = {
-		{ "thread-factor",         required_argument, NULL, 'C' },
-		{ "thread-min",            required_argument, NULL, 'G' },
-		{ "strip-paths",           no_argument,       NULL, 'P' },
-		{ "no-timestamps",         no_argument,       NULL, 'T' },
-		{ "workaround-btrfs-send", no_argument,       NULL, 'a' },
-		{ "thread-count",          required_argument, NULL, 'c' },
-		{ "loadavg-target",        required_argument, NULL, 'g' },
-		{ "help",                  no_argument,       NULL, 'h' },
-		{ "scan-mode",             required_argument, NULL, 'm' },
-		{ "absolute-paths",        no_argument,       NULL, 'p' },
-		{ "timestamps",            no_argument,       NULL, 't' },
-		{ "verbose",               required_argument, NULL, 'v' },
-		{ 0, 0, 0, 0 },
+		{ .name = "thread-factor",         .has_arg = required_argument, .val = 'C' },
+		{ .name = "throttle-factor",       .has_arg = required_argument, .val = BEES_OPT_THROTTLE_FACTOR },
+		{ .name = "thread-min",            .has_arg = required_argument, .val = 'G' },
+		{ .name = "strip-paths",           .has_arg = no_argument,       .val = 'P' },
+		{ .name = "no-timestamps",         .has_arg = no_argument,       .val = 'T' },
+		{ .name = "workaround-btrfs-send", .has_arg = no_argument,       .val = 'a' },
+		{ .name = "thread-count",          .has_arg = required_argument, .val = 'c' },
+		{ .name = "loadavg-target",        .has_arg = required_argument, .val = 'g' },
+		{ .name = "help",                  .has_arg = no_argument,       .val = 'h' },
+		{ .name = "scan-mode",             .has_arg = required_argument, .val = 'm' },
+		{ .name = "absolute-paths",        .has_arg = no_argument,       .val = 'p' },
+		{ .name = "timestamps",            .has_arg = no_argument,       .val = 't' },
+		{ .name = "verbose",               .has_arg = required_argument, .val = 'v' },
+		{ 0 },
 	};

 	// Build getopt_long's short option list from the long_options table.
 	// While we're at it, make sure we didn't duplicate any options.
 	string getopt_list;
-	set<decltype(option::val)> option_vals;
+	map<decltype(option::val), string> option_vals;
 	for (const struct option *op = long_options; op->val; ++op) {
-		THROW_CHECK1(runtime_error, op->val, !option_vals.count(op->val));
-		option_vals.insert(op->val);
+		const auto ins_rv = option_vals.insert(make_pair(op->val, op->name));
+		THROW_CHECK1(runtime_error, op->val, ins_rv.second);
 		if ((op->val & 0xff) != op->val) {
 			continue;
 		}
@@ -639,27 +821,31 @@ bees_main(int argc, char *argv[])
 	}

 	// Parse options
-	int c;
 	while (true) {
 		int option_index = 0;

-		c = getopt_long(argc, argv, getopt_list.c_str(), long_options, &option_index);
+		const auto c = getopt_long(argc, argv, getopt_list.c_str(), long_options, &option_index);
 		if (-1 == c) {
 			break;
 		}

-		BEESLOGDEBUG("Parsing option '" << static_cast<char>(c) << "'");
+		// getopt_long should have weeded out any invalid options,
+		// so we can go ahead and throw here
+		BEESLOGDEBUG("Parsing option '" << option_vals.at(c) << "'");

 		switch (c) {

 			case 'C':
 				thread_factor = stod(optarg);
 				break;
+			case BEES_OPT_THROTTLE_FACTOR:
+				bees_throttle_factor = stod(optarg);
+				break;
 			case 'G':
 				thread_min = stoul(optarg);
 				break;
 			case 'P':
-				crucible::set_relative_path(cwd);
+				use_relative_paths = true;
 				break;
 			case 'T':
 				chatter_prefix_timestamp = false;
@@ -677,7 +863,7 @@ bees_main(int argc, char *argv[])
 				root_scan_mode = static_cast<BeesRoots::ScanMode>(stoul(optarg));
 				break;
 			case 'p':
-				crucible::set_relative_path("");
+				use_relative_paths = false;
 				break;
 			case 't':
 				chatter_prefix_timestamp = true;
@@ -695,12 +881,12 @@ bees_main(int argc, char *argv[])
 			case 'h':
 			default:
 				do_cmd_help(argv);
-				return EXIT_FAILURE;
+				return EXIT_SUCCESS;
 		}
 	}

 	if (optind + 1 != argc) {
-		BEESLOGERR("Only one filesystem path per bees process");
+		BEESLOGERR("Exactly one filesystem path required");
 		return EXIT_FAILURE;
 	}

@@ -740,22 +926,32 @@ bees_main(int argc, char *argv[])
 	BEESLOGNOTICE("setting worker thread pool maximum size to " << thread_count);
 	TaskMaster::set_thread_count(thread_count);

+	BEESLOGNOTICE("setting throttle factor to " << bees_throttle_factor);
+
 	// Set root path
 	string root_path = argv[optind++];
 	BEESLOGNOTICE("setting root path to '" << root_path << "'");
 	bc->set_root_path(root_path);

+	// Set path prefix
+	if (use_relative_paths) {
+		crucible::set_relative_path(name_fd(bc->root_fd()));
+	}
+
 	// Workaround for btrfs send
 	bc->roots()->set_workaround_btrfs_send(workaround_btrfs_send);

 	// Set root scan mode
 	bc->roots()->set_scan_mode(root_scan_mode);

+	// Workaround for the logical-ino-vs-clone kernel bug
+	MultiLocker::enable_locking(true);
+
 	// Start crawlers
 	bc->start();

 	// Now we just wait forever
-	wait_for_term_signal();
+	wait_for_signals();

 	// Shut it down
 	bc->stop();
@@ -78,13 +78,13 @@ const int BEES_PROGRESS_INTERVAL = BEES_STATS_INTERVAL;
 const int BEES_STATUS_INTERVAL = 1;

 // Number of file FDs to cache when not in active use
-const size_t BEES_FILE_FD_CACHE_SIZE = 4096;
+const size_t BEES_FILE_FD_CACHE_SIZE = 524288;

 // Number of root FDs to cache when not in active use
-const size_t BEES_ROOT_FD_CACHE_SIZE = 1024;
+const size_t BEES_ROOT_FD_CACHE_SIZE = 65536;

 // Number of FDs to open (rlimit)
-const size_t BEES_OPEN_FILE_LIMIT = (BEES_FILE_FD_CACHE_SIZE + BEES_ROOT_FD_CACHE_SIZE) * 2 + 100;
+const size_t BEES_OPEN_FILE_LIMIT = BEES_FILE_FD_CACHE_SIZE + BEES_ROOT_FD_CACHE_SIZE + 100;

 // Worker thread factor (multiplied by detected number of CPU cores)
 const double BEES_DEFAULT_THREAD_FACTOR = 1.0;
@@ -93,10 +93,11 @@ const double BEES_DEFAULT_THREAD_FACTOR = 1.0;
 const double BEES_TOO_LONG = 5.0;

 // Avoid any extent where LOGICAL_INO takes this much kernel CPU time
-const double BEES_TOXIC_SYS_DURATION = 0.1;
+const double BEES_TOXIC_SYS_DURATION = 5.0;

-// Maximum number of refs to a single extent
-const size_t BEES_MAX_EXTENT_REF_COUNT = (16 * 1024 * 1024 / 24) - 1;
+// Maximum number of refs to a single extent before we have other problems
+// If we have more than 10K refs to an extent, adding another will save 0.01% space
+const size_t BEES_MAX_EXTENT_REF_COUNT = 9999; // (16 * 1024 * 1024 / 24);

 // How long between hash table histograms
 const double BEES_HASH_TABLE_ANALYZE_INTERVAL = BEES_STATS_INTERVAL;
@@ -121,9 +122,9 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
 // macros ----------------------------------------

 #define BEESLOG(lv,x)   do { if (lv < bees_log_level) { Chatter __chatter(lv, BeesNote::get_name()); __chatter << x; } } while (0)
-#define BEESLOGTRACE(x) do { BEESLOG(LOG_DEBUG, x); BeesTracer::trace_now(); } while (0)

-#define BEESTRACE(x)   BeesTracer  SRSLY_WTF_C(beesTracer_,  __LINE__) ([&]()                 { BEESLOG(LOG_ERR, x);   })
+#define BEES_TRACE_LEVEL LOG_DEBUG
+#define BEESTRACE(x)   BeesTracer  SRSLY_WTF_C(beesTracer_,  __LINE__) ([&]()                 { BEESLOG(BEES_TRACE_LEVEL, "TRACE: " << x << " at " << __FILE__ << ":" << __LINE__);   })
 #define BEESTOOLONG(x) BeesTooLong SRSLY_WTF_C(beesTooLong_, __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
 #define BEESNOTE(x)    BeesNote    SRSLY_WTF_C(beesNote_,    __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })

@@ -133,6 +134,14 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
 #define BEESLOGINFO(x)   BEESLOG(LOG_INFO, x)
 #define BEESLOGDEBUG(x)  BEESLOG(LOG_DEBUG, x)

+#define BEESLOGONCE(__x) do { \
+        static bool already_logged = false; \
+        if (!already_logged) { \
+                already_logged = true; \
+                BEESLOGNOTICE(__x); \
+        } \
+} while (false)
+
 #define BEESCOUNT(stat) do { \
 	BeesStats::s_global.add_count(#stat); \
 } while (0)
@@ -184,7 +193,7 @@ class BeesTracer {
 	thread_local static bool tl_silent;
 	thread_local static bool tl_first;
 public:
-	BeesTracer(function<void()> f, bool silent = false);
+	BeesTracer(const function<void()> &f, bool silent = false);
 	~BeesTracer();
 	static void trace_now();
 	static bool get_silent();
@@ -299,6 +308,11 @@ public:
 	off_t grow_begin(off_t delta);
 	/// @}

+	/// @{ Make range smaller
+	off_t shrink_end(off_t delta);
+	off_t shrink_begin(off_t delta);
+	/// @}
+
 friend ostream & operator<<(ostream &os, const BeesFileRange &bfr);
 };

@@ -515,7 +529,7 @@ class BeesCrawl {

 	bool fetch_extents();
 	void fetch_extents_harder();
-	bool next_transid();
+	bool restart_crawl_unlocked();
 	BeesFileRange bti_to_bfr(const BtrfsTreeItem &bti) const;

 public:
@@ -527,6 +541,9 @@ public:
 	BeesCrawlState get_state_end() const;
 	void set_state(const BeesCrawlState &bcs);
 	void deferred(bool def_setting);
+	bool deferred() const;
+	bool finished() const;
+	bool restart_crawl();
 };

 class BeesScanMode;
@@ -535,7 +552,8 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	shared_ptr<BeesContext>			m_ctx;

 	BeesStringFile				m_crawl_state_file;
-	map<uint64_t, shared_ptr<BeesCrawl>>	m_root_crawl_map;
+	using CrawlMap = map<uint64_t, shared_ptr<BeesCrawl>>;
+	CrawlMap				m_root_crawl_map;
 	mutex					m_mutex;
 	uint64_t				m_crawl_dirty = 0;
 	uint64_t				m_crawl_clean = 0;
@@ -554,17 +572,13 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	condition_variable			m_stop_condvar;
 	bool					m_stop_requested = false;

-	void insert_new_crawl();
-	void insert_root(const BeesCrawlState &bcs);
+	CrawlMap insert_new_crawl();
 	Fd open_root_nocache(uint64_t root);
 	Fd open_root_ino_nocache(uint64_t root, uint64_t ino);
-	uint64_t transid_min();
-	uint64_t transid_max();
 	uint64_t transid_max_nocache();
 	void state_load();
 	ostream &state_to_stream(ostream &os);
 	void state_save();
-	bool crawl_roots();
 	string crawl_state_filename() const;
 	void crawl_state_set_dirty();
 	void crawl_state_erase(const BeesCrawlState &bcs);
@@ -572,13 +586,16 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	void writeback_thread();
 	uint64_t next_root(uint64_t root = 0);
 	void current_state_set(const BeesCrawlState &bcs);
-	RateEstimator& transid_re();
 	bool crawl_batch(shared_ptr<BeesCrawl> crawl);
 	void clear_caches();
+	shared_ptr<BeesCrawl> insert_root(const BeesCrawlState &bcs);
+	bool up_to_date(const BeesCrawlState &bcs);

 friend class BeesCrawl;
 friend class BeesFdCache;
 friend class BeesScanMode;
+friend class BeesScanModeSubvol;
+friend class BeesScanModeExtent;

 public:
 	BeesRoots(shared_ptr<BeesContext> ctx);
@@ -594,17 +611,22 @@ public:
 	Fd open_root_ino(const BeesFileId &bfi) { return open_root_ino(bfi.root(), bfi.ino()); }
 	bool is_root_ro(uint64_t root);

-	// TODO:  do extent-tree scans instead
 	enum ScanMode {
 		SCAN_MODE_LOCKSTEP,
 		SCAN_MODE_INDEPENDENT,
 		SCAN_MODE_SEQUENTIAL,
 		SCAN_MODE_RECENT,
+		SCAN_MODE_EXTENT,
 		SCAN_MODE_COUNT, // must be last
 	};

 	void set_scan_mode(ScanMode new_mode);
 	void set_workaround_btrfs_send(bool do_avoid);
+
+	uint64_t transid_min();
+	uint64_t transid_max();
+
+	void wait_for_transid(const uint64_t count);
 };

 struct BeesHash {
@@ -664,6 +686,8 @@ class BeesRangePair : public pair<BeesFileRange, BeesFileRange> {
 public:
 	BeesRangePair(const BeesFileRange &src, const BeesFileRange &dst);
 	bool grow(shared_ptr<BeesContext> ctx, bool constrained);
+	void shrink_begin(const off_t delta);
+	void shrink_end(const off_t delta);
 	BeesRangePair copy_closed() const;
 	bool operator<(const BeesRangePair &that) const;
 friend ostream & operator<<(ostream &os, const BeesRangePair &brp);
@@ -737,11 +761,14 @@ class BeesContext : public enable_shared_from_this<BeesContext> {
 	shared_ptr<BeesThread>				m_progress_thread;
 	shared_ptr<BeesThread>				m_status_thread;

+	mutex						m_progress_mtx;
+	string						m_progress_str;
+
 	void set_root_fd(Fd fd);

 	BeesResolveAddrResult resolve_addr_uncached(BeesAddress addr);

-	BeesFileRange scan_one_extent(const BeesFileRange &bfr, const Extent &e);
+	void scan_one_extent(const BeesFileRange &bfr, const Extent &e);
 	void rewrite_file_range(const BeesFileRange &bfr);

 public:
@@ -772,6 +799,8 @@ public:

 	void dump_status();
 	void show_progress();
+	void set_progress(const string &str);
+	string get_progress();

 	void start();
 	void stop();
@@ -834,7 +863,7 @@ public:
 	BeesFileRange find_one_match(BeesHash hash);

 	void replace_src(const BeesFileRange &src_bfr);
-	BeesFileRange replace_dst(const BeesFileRange &dst_bfr);
+	BeesRangePair replace_dst(const BeesFileRange &dst_bfr);

 	bool found_addr() const { return m_found_addr; }
 	bool found_data() const { return m_found_data; }
@@ -868,7 +897,10 @@ extern const char *BEES_VERSION;
 extern thread_local default_random_engine bees_generator;
 string pretty(double d);
 void bees_readahead(int fd, off_t offset, size_t size);
+void bees_readahead_pair(int fd, off_t offset, size_t size, int fd2, off_t offset2, size_t size2);
 void bees_unreadahead(int fd, off_t offset, size_t size);
+void bees_throttle(double time_used, const char *context);
 string format_time(time_t t);
+bool exception_check();

 #endif
@@ -8,6 +8,7 @@ PROGRAMS = \
 	process \
 	progress \
 	seeker \
+	table \
 	task \

 all: test
@@ -19,7 +19,9 @@ seeker_finder(const vector<uint64_t> &vec, uint64_t lower, uint64_t upper)
 	if (ub != s.end()) ++ub;
 	if (ub != s.end()) ++ub;
 	for (; ub != s.end(); ++ub) {
-		if (*ub > upper) break;
+		if (*ub > upper) {
+			break;
+		}
 	}
 	return set<uint64_t>(lb, ub);
 }
@@ -28,7 +30,7 @@ static bool test_fails = false;

 static
 void
-seeker_test(const vector<uint64_t> &vec, uint64_t const target)
+seeker_test(const vector<uint64_t> &vec, uint64_t const target, bool const always_out = false)
 {
 	cerr << "Find " << target << " in {";
 	for (auto i : vec) {
@@ -36,11 +38,13 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
 	}
 	cerr << " } = ";
 	size_t loops = 0;
+	tl_seeker_debug_str = make_shared<ostringstream>();
+	bool local_test_fails = false;
 	bool excepted = catch_all([&]() {
-		auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
+		const auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
 			++loops;
 			return seeker_finder(vec, lower, upper);
-		});
+		}, uint64_t(32));
 		cerr << found;
 		uint64_t my_found = 0;
 		for (auto i : vec) {
@@ -52,13 +56,15 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
 			cerr << " (correct)";
 		} else {
 			cerr << " (INCORRECT - right answer is " << my_found << ")";
-			test_fails = true;
+			local_test_fails = true;
 		}
 	});
 	cerr << " (" << loops << " loops)" << endl;
-	if (excepted) {
-		test_fails = true;
+	if (excepted || local_test_fails || always_out) {
+		cerr << dynamic_pointer_cast<ostringstream>(tl_seeker_debug_str)->str();
 	}
+	test_fails = test_fails || local_test_fails;
+	tl_seeker_debug_str.reset();
 }

 static
@@ -89,6 +95,39 @@ test_seeker()
 	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max());
 	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max() - 1);
 	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() - 1 }, numeric_limits<uint64_t>::max());
+
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 0);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 1);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 2);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 3);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 4);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 5);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 6);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 7);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 8);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 9);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 1 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 2 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 3 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 4 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 5 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 6 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 7 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 8 );
+
+	// Pulled from a bees debug log
+	seeker_test(vector<uint64_t> {
+		6821962845,
+		6821962848,
+		6821963411,
+		6821963422,
+		6821963536,
+		6821963539,
+		6821963835, // <- appeared during the search, causing an exception
+		6821963841,
+		6822575316,
+	}, 6821971036, true);
 }


@@ -0,0 +1,63 @@
+#include "tests.h"
+
+#include "crucible/table.h"
+
+using namespace crucible;
+using namespace std;
+
+void
+print_table(const Table::Table& t)
+{
+	cerr << "BEGIN TABLE\n";
+	cerr << t;
+	cerr << "END TABLE\n";
+	cerr << endl;
+}
+
+void
+test_table()
+{
+	Table::Table t;
+	t.insert_row(Table::endpos, vector<Table::Content> {
+		Table::Text("Hello, World!"),
+		Table::Text("2"),
+		Table::Text("3"),
+		Table::Text("4"),
+	});
+	print_table(t);
+	t.insert_row(Table::endpos, vector<Table::Content> {
+		Table::Text("Greeting"),
+		Table::Text("two"),
+		Table::Text("three"),
+		Table::Text("four"),
+	});
+	print_table(t);
+	t.insert_row(Table::endpos, vector<Table::Content> {
+		Table::Fill('-'),
+		Table::Text("ii"),
+		Table::Text("iii"),
+		Table::Text("iv"),
+	});
+	print_table(t);
+	t.mid(" | ");
+	t.left("| ");
+	t.right(" |");
+	print_table(t);
+	t.insert_col(1, vector<Table::Content> {
+		Table::Text("1"),
+		Table::Text("one"),
+		Table::Text("i"),
+		Table::Text("I"),
+	});
+	print_table(t);
+	t.at(2, 1) = Table::Text("Two\nLines");
+	print_table(t);
+}
+
+int
+main(int, char**)
+{
+	RUN_A_TEST(test_table());
+
+	exit(EXIT_SUCCESS);
+}