fsync: fix signed comparison of stf.f_type

Build fails on 32-bit Slackware because GCC 11's `-Werror=sign-compare` is stricter than necessary: cc -Wall -Wextra -Werror -O3 -I../include -D_FILE_OFFSET_BITS=64 -std=c99 -O2 -march=i586 -mtune=i686 -o bees-version.o -c bees-version.c bees.cc: In function 'void bees_fsync(int)': bees.cc:426:24: error: comparison of integer expressions of different signedness: '__fsword_t' {aka 'int'} and 'unsigned int' [-Werror=sign-compare] 426 | if (stf.f_type != BTRFS_SUPER_MAGIC) { | ^ To work around this, cast `stf.f_type` to the same type as `BTRFS_SUPER_MAGIC`, so it has the same number of bits that we're looking for in the magic value. Fixes: https://github.com/Zygo/bees/issues/317 Signed-off-by: Zygo Blaxell <bees@furryterror.org>
tempfile: don't need to update the inode if the flags don't change
2025-12-01 09:13:38 +01:00 · 2025-07-03 21:48:40 -04:00 · 2025-06-29 23:34:10 -04:00 · 2025-06-29 23:25:36 -04:00 · 2025-06-29 23:24:55 -04:00 · 2025-06-18 23:06:14 -04:00
35 changed files with 1489 additions and 854 deletions
--- a/README.md
+++ b/README.md
@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
 About bees
 ----------

-bees is a block-oriented userspace deduplication agent designed for large
-btrfs filesystems.  It is an offline dedupe combined with an incremental
-data scan capability to minimize time data spends on disk from write
-to dedupe.
+bees is a block-oriented userspace deduplication agent designed to scale
+up to large btrfs filesystems.  It is an offline dedupe combined with
+an incremental data scan capability to minimize time data spends on disk
+from write to dedupe.

 Strengths
 ---------

- * Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
- * Daemon incrementally dedupes new data using btrfs tree search
+ * Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
+ * Daemon mode - incrementally dedupes new data as it appears
+ * Largest extents first - recover more free space during fixed maintenance windows
 * Works with btrfs compression - dedupe any combination of compressed and uncompressed files
- * Works around btrfs filesystem structure to free more disk space
+ * Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
 * Persistent hash table for rapid restart after shutdown
- * Whole-filesystem dedupe - including snapshots
 * Constant hash table size - no increased RAM usage if data set becomes larger
 * Works on live data - no scheduled downtime required
- * Automatic self-throttling based on system load
+ * Automatic self-throttling - reduces system load
+ * btrfs support - recovers more free space from btrfs than naive dedupers

 Weaknesses
 ----------

 * Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
- * Requires root privilege (or `CAP_SYS_ADMIN`)
- * First run may require temporary disk space for extent reorganization
+ * Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
 * [First run may increase metadata space usage if many snapshots exist](docs/gotchas.md)
 * Constant hash table size - no decreased RAM usage if data set becomes smaller
 * btrfs only
@@ -46,7 +46,7 @@ Recommended Reading
 -------------------

 * [bees Gotchas](docs/gotchas.md)
- * [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING
+ * [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
 * [bees vs. other btrfs features](docs/btrfs-other.md)
 * [What to do when something goes wrong](docs/wrong.md)

@@ -69,6 +69,6 @@ You can also use Github:
 Copyright & License
 -------------------

-Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
+Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.

 GPL (version 3 or later).
--- a/docs/btrfs-kernel.md
+++ b/docs/btrfs-kernel.md
@@ -1,31 +1,24 @@
-Recommended Kernel Version for bees
-===================================
+Recommended Linux Kernel Version for bees
+=========================================

-First, a warning that is not specific to bees:
+First, a warning about old Linux kernel versions:

-> **Kernel 5.1, 5.2, and 5.3 should not be used with btrfs due to a
-severe regression that can lead to fatal metadata corruption.**
-This issue is fixed in kernel 5.4.14 and later.
+> **Linux kernel version 5.1, 5.2, and 5.3 should not be used with btrfs
+due to a severe regression that can lead to fatal metadata corruption.**
+This issue is fixed in version 5.4.14 and later.

-**Recommended kernel versions for bees are 4.19, 5.4, 5.10, 5.11, 5.15,
-6.0, or 6.1, with recent LTS and -stable updates.**  The latest released
-kernel as of this writing is 6.4.1.
+**Recommended Linux kernel versions for bees are 5.4, 5.10, 5.15, 6.1,
+6.6, or 6.12 with recent LTS and -stable updates.**  The latest released
+kernel as of this writing is 6.12.9, and the earliest supported LTS
+kernel is 5.4.

-4.14, 4.9, and 4.4 LTS kernels with recent updates are OK with some
-issues.  Older kernels will be slower (a little slower or a lot slower
-depending on which issues are triggered).  Not all fixes are backported.
-
-Obsolete non-LTS kernels have a variety of unfixed issues and should
-not be used with btrfs.  For details see the table below.
-
-bees requires btrfs kernel API version 4.2 or higher, and does not work
-at all on older kernels.
-
-Some bees features rely on kernel 4.15 to work, and these features will
-not be available on older kernels.  Currently, bees is still usable on
-older kernels with degraded performance or with options disabled, but
-support for older kernels may be removed.
+Some optional bees features use kernel APIs introduced in kernel 4.15
+(extent scan) and 5.6 (`openat2` support).  These bees features are not
+available on older kernels.  Support for older kernels may be removed
+in a future bees release.

+bees will not run at all on kernels before 4.2 due to lack of minimal
+API support.



@@ -62,6 +55,7 @@ These bugs are particularly popular among bees users, though not all are specifi
 | 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.4.125, 5.10.43, 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
 | - | 5.11 | tree mod log issue #5 | 4.4.263, 4.9.263, 4.14.227, 4.19.183, 5.4.108, 5.10.26, 5.11.9, 5.12 and later | dbcc7d57bffc btrfs: fix race when cloning extent buffer during rewind of an old root
 | - | 5.12 | tree mod log issue #6 | 4.14.233, 4.19.191, 5.4.118, 5.10.36, 5.11.20, 5.12.3, 5.13 and later | f9690f426b21 btrfs: fix race when picking most recent mod log operation for an old root
+| 5.11 | 5.12 | subvols marked for deletion with `btrfs sub del` become permanently undeletable ("ghost" subvols) | 5.12 stopped creation of new ghost subvols | Partially fixed in 8d488a8c7ba2 btrfs: fix subvolume/snapshot deletion not triggered on mount.  Qu wrote a [patch](https://github.com/adam900710/linux/commit/9de990fcc8864c376eb28aa7482c54321f94acd4) to allow `btrfs sub del -i` to remove "ghost" subvols, but it was never merged upstream.
 | 4.15 | 5.16 | spurious warnings from `fs/fs-writeback.c` when `flushoncommit` is enabled | 5.15.27, 5.16.13, 5.17 and later | a0f0cf8341e3 btrfs: get rid of warning on transaction commit when using flushoncommit
 | - | 5.17 | crash during device removal can make filesystem unmountable | 5.15.54, 5.16.20, 5.17.3, 5.18 and later | bbac58698a55 btrfs: remove device item and update super block in the same transaction
 | - | 5.18 | wrong superblock num_devices makes filesystem unmountable | 4.14.283, 4.19.247, 5.4.198, 5.10.121, 5.15.46, 5.17.14, 5.18.3, 5.19 and later | d201238ccd2f btrfs: repair super block num_devices automatically
@@ -71,7 +65,7 @@ These bugs are particularly popular among bees users, though not all are specifi
 | 6.3, backported to 5.15.107, 6.1.24, 6.2.11 | 6.3 | vmalloc error, failed to allocate pages | 6.3.10, 6.4 and later.  Bug (f349b15e183d "mm: vmalloc: avoid warn_alloc noise caused by fatal signal" in v6.3-rc6) backported to 6.1.24, 6.2.11, and 5.15.107. | 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
 | 6.2 | 6.3 | `IGNORE_OFFSET` flag ignored in `LOGICAL_INO` ioctl | 6.2.16, 6.3.3, 6.4 and later | 0cad8f14d70c btrfs: fix backref walking not returning all inode refs
 | 6.10 | 6.11 | `adding refs to an existing tree ref`, `failed to run delayed ref`, then read-only | 6.11.10, 6.12 and later | 7d493a5ecc26 btrfs: fix incorrect comparison for delayed refs
-| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe ioctl on the same extent | - | workaround: avoid doing that
+| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe/clone ioctl on the same extent | - | workaround: avoid doing that

 "Last bad kernel" refers to that version's last stable update from
 kernel.org.  Distro kernels may backport additional fixes.  Consult
@@ -97,12 +91,12 @@ contains the last committed component of the fix.
 Workarounds for known kernel bugs
 ---------------------------------

-* **Hangs with concurrent `LOGICAL_INO` and dedupe**:  on all
-  kernel versions so far, multiple threads running `LOGICAL_INO`
-  and dedupe ioctls at the same time on the same inodes or extents
+* **Hangs with concurrent `LOGICAL_INO` and dedupe/clone**:  on all
+  kernel versions so far, multiple threads running `LOGICAL_INO` and
+  dedupe/clone ioctls at the same time on the same inodes or extents
  can lead to a kernel hang.  The kernel enters an infinite loop in
  `add_all_parents`, where `count` is 0, `ref->count` is 1, and
-  `btrfs_next_item` or `btrfs_next_old_item` never find a matching ref).
+  `btrfs_next_item` or `btrfs_next_old_item` never find a matching ref.

  bees has two workarounds for this bug: 1. schedule work so that multiple
  threads do not simultaneously access the same inode or the same extent,
@@ -123,58 +117,32 @@ Workarounds for known kernel bugs

  It is still theoretically possible to trigger the kernel bug when
  running bees at the same time as other dedupers, or other programs
-  that use `LOGICAL_INO` like `btdu`; however, it's extremely difficult
-  to reproduce the bug without closely cooperating threads.
+  that use `LOGICAL_INO` like `btdu`, or when performing a reflink clone
+  operation such as `cp` or `mv`; however, it's extremely difficult to
+  reproduce the bug without closely cooperating threads.

-* **Slow backrefs** (aka toxic extents):  Under certain conditions,
-  if the number of references to a single shared extent grows too
-  high, the kernel consumes more and more CPU while also holding locks
-  that delay write access to the filesystem.  bees avoids this bug
-  by measuring the time the kernel spends performing `LOGICAL_INO`
-  operations and permanently blacklisting any extent or hash involved
-  where the kernel starts to get slow.  In the bees log, such blocks
-  are labelled as 'toxic' hash/block addresses.  Toxic extents are
-  rare (about 1 in 100,000 extents become toxic), but toxic extents can
-  become 8 orders of magnitude more expensive to process than the fastest
-  non-toxic extents.  This seems to affect all dedupe agents on btrfs;
-  at this time of writing only bees has a workaround for this bug.
+* **Slow backrefs** (aka toxic extents):  On older kernels, under certain
+  conditions, if the number of references to a single shared extent grows
+  too high, the kernel consumes more and more CPU while also holding
+  locks that delay write access to the filesystem.  This is no longer
+  a concern on kernels after 5.7 (or an up-to-date 5.4 LTS version),
+  but there are still some remains of earlier workarounds for this issue
+  in bees that have not been fully removed.

-  This workaround is less necessary for kernels 5.4.96, 5.7 and later,
-  though the bees workaround can still be triggered on newer kernels
-  by changes in btrfs since kernel version 5.1.
+  bees avoided this bug by measuring the time the kernel spends performing
+  `LOGICAL_INO` operations and permanently blacklisting any extent or
+  hash involved where the kernel starts to get slow.  In the bees log,
+  such blocks are labelled as 'toxic' hash/block addresses.
+
+  Future bees releases will remove toxic extent detection (it only detects
+  false positives now) and clear all previously saved toxic extent bits.

 * **dedupe breaks `btrfs send` in old kernels**.  The bees option
  `--workaround-btrfs-send` prevents any modification of read-only subvols
-  in order to avoid breaking `btrfs send`.
+  in order to avoid breaking `btrfs send` on kernels before 5.2.

-  This workaround is no longer necessary to avoid kernel crashes
-  and send performance failure on kernel 4.9.207, 4.14.159, 4.19.90,
-  5.3.17, 5.4.4, 5.5 and later; however, some conflict between send
-  and dedupe still remains, so the workaround is still useful.
+  This workaround is no longer necessary to avoid kernel crashes and
+  send performance failure on kernel 5.4.4 and later.  bees will pause
+  dedupe until the send is finished on current kernels.

  `btrfs receive` is not and has never been affected by this issue.
-
-Unfixed kernel bugs
-------------------
-
-* **The kernel does not permit `btrfs send` and dedupe to run at the
-  same time**.  Recent kernels no longer crash, but now refuse one
-  operation with an error if the other operation was already running.
-
-  bees has not been updated to handle the new dedupe behavior optimally.
-  Optimal behavior is to defer dedupe operations when send is detected,
-  and resume after the send is finished.  Current bees behavior is to
-  complain loudly about each individual dedupe failure in log messages,
-  and abandon duplicate data references in the snapshot that send is
-  processing.  A future bees version shall have better handling for
-  this situation.
-
-  Workaround:  send `SIGSTOP` to bees, or terminate the bees process,
-  before running `btrfs send`.
-
-  This workaround is not strictly required if snapshot is deleted after
-  sending.  In that case, any duplicate data blocks that were not removed
-  by dedupe will be removed by snapshot delete instead.  The workaround
-  still saves some IO.
-
-  `btrfs receive` is not affected by this issue.
--- a/docs/btrfs-other.md
+++ b/docs/btrfs-other.md
@@ -3,40 +3,34 @@ Good Btrfs Feature Interactions

 bees has been tested in combination with the following:

-* btrfs compression (zlib, lzo, zstd), mixtures of compressed and uncompressed extents
+* btrfs compression (zlib, lzo, zstd)
 * PREALLOC extents (unconditionally replaced with holes)
 * HOLE extents and btrfs no-holes feature
-* Other deduplicators, reflink copies (though bees may decide to redo their work)
-* btrfs snapshots and non-snapshot subvols (RW and RO)
+* Other deduplicators (`duperemove`, `jdupes`)
+* Reflink copies (modern coreutils `cp` and `mv`)
 * Concurrent file modification (e.g. PostgreSQL and sqlite databases, VMs, build daemons)
-* All btrfs RAID profiles
-* IO errors during dedupe (read errors will throw exceptions, bees will catch them and skip over the affected extent)
-* Filesystems mounted with or without the `flushoncommit` option
+* All btrfs RAID profiles:  single, dup, raid0, raid1, raid10, raid1c3, raid1c4, raid5, raid6
+* IO errors during dedupe (affected extents are skipped)
 * 4K filesystem data block size / clone alignment
 * 64-bit and 32-bit LE host CPUs (amd64, x86, arm)
 * Large files (kernel 5.4 or later strongly recommended)
-* Filesystems up to 90T+ bytes, 1000M+ files
+* Filesystem data sizes up to 100T+ bytes, 1000M+ files
+* `open(O_DIRECT)` (seems to work as well--or as poorly--with bees as with any other btrfs feature)
+* btrfs-convert from ext2/3/4
+* btrfs `autodefrag` mount option
+* btrfs balance (data balances cause rescan of relocated data)
+* btrfs block-group-tree
+* btrfs `flushoncommit` and `noflushoncommit` mount options
+* btrfs mixed block groups
+* btrfs `nodatacow`/`nodatasum` inode attribute or mount option (bees skips all nodatasum files)
+* btrfs qgroups and quota support (_not_ squotas)
 * btrfs receive
-* btrfs nodatacow/nodatasum inode attribute or mount option (bees skips all nodatasum files)
-* open(O_DIRECT) (seems to work as well--or as poorly--with bees as with any other btrfs feature)
-* lvm dm-cache, writecache
+* btrfs scrub
+* btrfs send (dedupe pauses automatically, kernel 5.4 or later required)
+* btrfs snapshot, non-snapshot subvols (RW and RO), snapshot delete

-Bad Btrfs Feature Interactions
------------------------------
-
-bees has been tested in combination with the following, and various problems are known:
-
-* btrfs send:  there are bugs in `btrfs send` that can be triggered by
-  bees on old kernels.  The [`--workaround-btrfs-send` option](options.md)
-  works around this issue by preventing bees from modifying read-only
-  snapshots.
-
-* btrfs qgroups:  very slow, sometimes hangs...and it's even worse when
-  bees is running.
-
-* btrfs autodefrag mount option:  bees cannot distinguish autodefrag
-  activity from normal filesystem activity, and may try to undo the
-  autodefrag if duplicate copies of the defragmented data exist.
+**Note:** some btrfs features have minimum kernel versions which are
+higher than the minimum kernel version for bees.

 Untested Btrfs Feature Interactions
 -----------------------------------
@@ -45,10 +39,6 @@ bees has not been tested with the following, and undesirable interactions may oc

 * Non-4K filesystem data block size (should work if recompiled)
 * Non-equal hash (SUM) and filesystem data block (CLONE) sizes (need to fix that eventually)
-* btrfs seed filesystems (no particular reason it wouldn't work, but no one has reported trying)
-* btrfs out-of-tree kernel patches (e.g. in-kernel dedupe, encryption, extent tree v2)
-* btrfs-convert from ext2/3/4 (never tested, might run out of space or ignore significant portions of the filesystem due to sanity checks)
-* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
+* btrfs seed filesystems, raid-stripe-tree, squotas (no particular reason these wouldn't work, but no one has reported trying)
+* btrfs out-of-tree kernel patches (e.g. encryption, extent tree v2)
 * Host CPUs with exotic page sizes, alignment requirements, or endianness (ppc, alpha, sparc, strongarm, s390, mips, m68k...)
-* bcache: used to be in the "bad" list, now in the "untested" list because nobody is rigorously testing, and bcache bugs come and go
-* flashcache: an out-of-tree cache-HDD-on-SSD block layer helper
--- a/docs/config.md
+++ b/docs/config.md
@@ -26,11 +26,7 @@ Here are some numbers to estimate appropriate hash table sizes:
 Notes:

 * If the hash table is too large, no extra dedupe efficiency is
-obtained, and the extra space wastes RAM.  If the hash table contains
-more block records than there are blocks in the filesystem, the extra
-space can slow bees down.  A table that is too large prevents obsolete
-data from being evicted, so bees wastes time looking for matching data
-that is no longer present on the filesystem.
+obtained, and the extra space wastes RAM.

 * If the hash table is too small, bees extrapolates from matching
 blocks to find matching adjacent blocks in the filesystem that have been
@@ -59,19 +55,19 @@ patterns on dedupe effectiveness without performing deep inspection of
 both the filesystem data and its structure--a task that is as expensive
 as performing the deduplication.

-* **Compression** on the filesystem reduces the average extent length
-compared to uncompressed filesystems.  The maximum compressed extent
-length on btrfs is 128KB, while the maximum uncompressed extent length
-is 128MB.  Longer extents decrease the optimum hash table size while
-shorter extents increase the optimum hash table size because the
-probability of a hash table entry being present (i.e. unevicted) in
-each extent is proportional to the extent length.
+* **Compression** in files reduces the average extent length compared
+to uncompressed files.  The maximum compressed extent length on
+btrfs is 128KB, while the maximum uncompressed extent length is 128MB.
+Longer extents decrease the optimum hash table size while shorter extents
+increase the optimum hash table size, because the probability of a hash
+table entry being present (i.e. unevicted) in each extent is proportional
+to the extent length.

   As a rule of thumb, the optimal hash table size for a compressed
 filesystem is 2-4x larger than the optimal hash table size for the same
-data on an uncompressed filesystem.  Dedupe efficiency falls dramatically
-with hash tables smaller than 128MB/TB as the average dedupe extent size
-is larger than the largest possible compressed extent size (128KB).
+data on an uncompressed filesystem.  Dedupe efficiency falls rapidly with
+hash tables smaller than 128MB/TB as the average dedupe extent size is
+larger than the largest possible compressed extent size (128KB).

 * **Short writes or fragmentation** also shorten the average extent
 length and increase optimum hash table size.  If a database writes to
@@ -115,7 +111,6 @@ Extent scan mode:
 * Works with 4.15 and later kernels.
 * Can estimate progress and provide an ETA.
 * Can optimize scanning order to dedupe large extents first.
- * Cannot avoid modifying read-only subvols.
 * Can keep up with frequent creation and deletion of snapshots.

 Subvol scan modes:
@@ -123,8 +118,7 @@ Subvol scan modes:
 * Work with 4.14 and earlier kernels.
 * Cannot estimate or report progress.
 * Cannot optimize scanning order by extent size.
- * Can avoid modifying read-only subvols (for `btrfs send` workaround).
- * Have problems keeping up with snapshots created during a scan.
+ * Have problems keeping up with multiple snapshots created during a scan.

 The default scan mode is 4, "extent".

@@ -212,7 +206,7 @@ Extent scan mode
 Scan mode 4, "extent", scans the extent tree instead of the subvol trees.
 Extent scan mode reads each extent once, regardless of the number of
 reflinks or snapshots.  It adapts to the creation of new snapshots
-immediately, without having to revisit old data.
+and reflinks immediately, without having to revisit old data.

 In the extent scan mode, extents are separated into multiple size tiers
 to prioritize large extents over small ones.  Deduping large extents
@@ -268,17 +262,54 @@ send` in extent scan mode, and restart bees after the `send` is complete.
 Threads and load management
 ---------------------------

-By default, bees creates one worker thread for each CPU detected.
-These threads then perform scanning and dedupe operations.  The number of
-worker threads can be set with the [`--thread-count` and `--thread-factor`
-options](options.md).
+By default, bees creates one worker thread for each CPU detected.  These
+threads then perform scanning and dedupe operations.  bees attempts to
+maximize the amount of productive work each thread does, until either the
+threads are all continuously busy, or there is no remaining work to do.

-If desired, bees can automatically increase or decrease the number
-of worker threads in response to system load.  This reduces impact on
-the rest of the system by pausing bees when other CPU and IO intensive
-loads are active on the system, and resumes bees when the other loads
-are inactive.  This is configured with the [`--loadavg-target` and
-`--thread-min` options](options.md).
+In many cases it is not desirable to continually run bees at maximum
+performance.  Maximum performance is not necessary if bees can dedupe
+new data faster than it appears on the filesystem.  If it only takes
+bees 10 minutes per day to dedupe all new data on a filesystem, then
+bees doesn't need to run for more than 10 minutes per day.
+
+bees supports a number of options for reducing system load:
+
+ * Run bees for a few hours per day, at an off-peak time (i.e. during
+ a maintenace window), instead of running bees continuously.  Any data
+ added to the filesystem while bees is not running will be scanned when
+ bees restarts.  At the end of the maintenance window, terminate the
+ bees process with SIGTERM to write the hash table and scan position
+ for the next maintenance window.
+
+ * Temporarily pause bees operation by sending the bees process SIGUSR1,
+ and resume operation with SIGUSR2.  This is preferable to freezing
+ and thawing the process, e.g. with freezer cgroups or SIGSTOP/SIGCONT
+ signals, because it allows bees to close open file handles that would
+ otherwise prevent those files from being deleted while bees is frozen.
+
+ * Reduce the number of worker threads with the [`--thread-count` or
+`--thread-factor` options](options.md).  This simply leaves CPU cores
+ idle so that other applications on the host can use them, or to save
+ power.
+
+ * Allow bees to automatically track system load and increase or decrease
+ the number of threads to reach a target system load.  This reduces
+ impact on the rest of the system by pausing bees when other CPU and IO
+ intensive loads are active on the system, and resumes bees when the other
+ loads are inactive.  This is configured with the [`--loadavg-target`
+ and `--thread-min` options](options.md).
+
+ * Allow bees to self-throttle operations that enqueue delayed work
+ within btrfs.  These operations are not well controlled by Linux
+ features such as process priority or IO priority or IO rate-limiting,
+ because the enqueued work is submitted to btrfs several seconds before
+ btrfs performs the work.  By the time btrfs performs the work, it's too
+ late for external throttling to be effective.  The [`--throttle-factor`
+ option](options.md) tracks how long it takes btrfs to complete queued
+ operations, and reduces bees's queued work submission rate to match
+ btrfs's queued work completion rate (or a fraction thereof, to reduce
+ system load).

 Log verbosity
 -------------
--- a/docs/event-counters.md
+++ b/docs/event-counters.md
@@ -120,13 +120,14 @@ The `crawl` event group consists of operations related to scanning btrfs trees t

 * `crawl_again`: An inode crawl was restarted because the extent was already locked by another running crawl.
 * `crawl_blacklisted`: An extent was not scanned because it belongs to a blacklisted file.
- * `crawl_create`: A new subvol or extent crawler was created.
 * `crawl_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
 * `crawl_done`: One pass over a subvol was completed.
- * `crawl_discard`: An extent that didn't match the crawler's size tier was discarded.
+ * `crawl_discard_high`: An extent that was too large for the crawler's size tier was discarded.
+ * `crawl_discard_low`: An extent that was too small for the crawler's size tier was discarded.
 * `crawl_empty`: A `TREE_SEARCH_V2` ioctl call failed or returned an empty set (usually because all data in the subvol was scanned).
 * `crawl_extent`: The extent crawler queued all references to an extent for processing.
 * `crawl_fail`: A `TREE_SEARCH_V2` ioctl call failed.
+ * `crawl_flop`: Small extent items were not skipped because the next extent started at or before the end of the previous extent.
 * `crawl_gen_high`: An extent item in the search results refers to an extent that is newer than the current crawl's `max_transid` allows.
 * `crawl_gen_low`: An extent item in the search results refers to an extent that is older than the current crawl's `min_transid` allows.
 * `crawl_hole`: An extent item in the search results refers to a hole.
@@ -138,6 +139,8 @@ The `crawl` event group consists of operations related to scanning btrfs trees t
 * `crawl_prealloc`: An extent item in the search results refers to a `PREALLOC` extent.
 * `crawl_push`: An extent item in the search results is suitable for scanning and deduplication.
 * `crawl_scan`: An extent item in the search results is submitted to `BeesContext::scan_forward` for scanning and deduplication.
+ * `crawl_skip`: Small extent items were skipped because no extent of sufficient size was found within the minimum search distance.
+ * `crawl_skip_ms`: Time spent skipping small extent items.
 * `crawl_search`: A `TREE_SEARCH_V2` ioctl call was successful.
 * `crawl_throttled`: Extent scan created too many work queue items and was prevented from creating any more.
 * `crawl_tree_block`: Extent scan found and skipped a metadata tree block.
@@ -281,11 +284,14 @@ The `progress` event group consists of events related to progress estimation.
 readahead
 ---------

-The `readahead` event group consists of events related to calls to `posix_fadvise`.
+The `readahead` event group consists of events related to data prefetching (formerly calls to `posix_fadvise` or `readahead`, but now emulated in userspace).

+ * `readahead_bytes`: Number of bytes prefetched.
+ * `readahead_count`: Number of read calls.
 * `readahead_clear`: Number of times the duplicate read cache was cleared.
- * `readahead_skip`: Number of times a duplicate read was identified in the cache and skipped.
+ * `readahead_fail`: Number of read errors during prefetch.
 * `readahead_ms`: Total time spent emulating readahead in user-space (kernel readahead is not measured).
+ * `readahead_skip`: Number of times a duplicate read was identified in the cache and skipped.
 * `readahead_unread_ms`: Total time spent running `posix_fadvise(..., POSIX_FADV_DONTNEED)`.

 replacedst
--- a/docs/index.md
+++ b/docs/index.md
@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
 About bees
 ----------

-bees is a block-oriented userspace deduplication agent designed for large
-btrfs filesystems.  It is an offline dedupe combined with an incremental
-data scan capability to minimize time data spends on disk from write
-to dedupe.
+bees is a block-oriented userspace deduplication agent designed to scale
+up to large btrfs filesystems.  It is an offline dedupe combined with
+an incremental data scan capability to minimize time data spends on disk
+from write to dedupe.

 Strengths
 ---------

- * Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
- * Daemon incrementally dedupes new data using btrfs tree search
+ * Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
+ * Daemon mode - incrementally dedupes new data as it appears
+ * Largest extents first - recover more free space during fixed maintenance windows
 * Works with btrfs compression - dedupe any combination of compressed and uncompressed files
- * Works around btrfs filesystem structure to free more disk space
+ * Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
 * Persistent hash table for rapid restart after shutdown
- * Whole-filesystem dedupe - including snapshots
 * Constant hash table size - no increased RAM usage if data set becomes larger
 * Works on live data - no scheduled downtime required
- * Automatic self-throttling based on system load
+ * Automatic self-throttling - reduces system load
+ * btrfs support - recovers more free space from btrfs than naive dedupers

 Weaknesses
 ----------

 * Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
- * Requires root privilege (or `CAP_SYS_ADMIN`)
- * First run may require temporary disk space for extent reorganization
+ * Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
 * [First run may increase metadata space usage if many snapshots exist](gotchas.md)
 * Constant hash table size - no decreased RAM usage if data set becomes smaller
 * btrfs only
@@ -46,7 +46,7 @@ Recommended Reading
 -------------------

 * [bees Gotchas](gotchas.md)
- * [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING
+ * [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
 * [bees vs. other btrfs features](btrfs-other.md)
 * [What to do when something goes wrong](wrong.md)

@@ -69,6 +69,6 @@ You can also use Github:
 Copyright & License
 -------------------

-Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
+Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.

 GPL (version 3 or later).
--- a/docs/options.md
+++ b/docs/options.md
@@ -84,19 +84,22 @@

 * `--workaround-btrfs-send` or `-a`

+ _This option is obsolete and should not be used any more._
+
 Pretend that read-only snapshots are empty and silently discard any
-request to dedupe files referenced through them.  This is a workaround for
-[problems with the kernel implementation of `btrfs send` and `btrfs send
+request to dedupe files referenced through them.  This is a workaround
+for [problems with old kernels running `btrfs send` and `btrfs send
 -p`](btrfs-kernel.md) which make these btrfs features unusable with bees.

- This option should be used to avoid breaking `btrfs send` on the same
-filesystem.
+ This option was used to avoid breaking `btrfs send` on old kernels.
+ The affected kernels are now too old to be recommended for use with bees.
+
+ bees now waits for `btrfs send` to finish.  There is no need for an
+ option to enable this.

 **Note:** There is a _significant_ space tradeoff when using this option:
 it is likely no space will be recovered--and possibly significant extra
-space used--until the read-only snapshots are deleted.  On the other
-hand, if snapshots are rotated frequently then bees will spend less time
-scanning them.
+space used--until the read-only snapshots are deleted.

 ## Logging options

--- a/docs/wrong.md
+++ b/docs/wrong.md
@@ -4,16 +4,13 @@ What to do when something goes wrong with bees
 Hangs and excessive slowness
 ----------------------------

-### Are you using qgroups or autodefrag?
-
-  Read about [bad btrfs feature interactions](btrfs-other.md).
-
 ### Use load-throttling options

  If bees is just more aggressive than you would like, consider using
  [load throttling options](options.md).  These are usually more effective
  than `ionice`, `schedtool`, and the `blkio` cgroup (though you can
-  certainly use those too).
+  certainly use those too) because they limit work that bees queues up
+  for later execution inside btrfs.

 ### Check `$BEESSTATUS`

@@ -52,10 +49,6 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li

 Thread names of note:

- * `crawl_12345`: scan/dedupe worker threads (the number is the subvol
-   ID which the thread is currently working on).  These threads appear
-   and disappear from the status dynamically according to the requirements
-   of the work queue and loadavg throttling.
 * `bees`: main thread (doesn't do anything after startup, but its task execution time is that of the whole bees process)
 * `crawl_master`: task that finds new extents in the filesystem and populates the work queue
 * `crawl_transid`: btrfs transid (generation number) tracker and polling thread
@@ -64,6 +57,13 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li
 * `hash_writeback`: trickle-writes the hash table back to `beeshash.dat`
 * `hash_prefetch`: prefetches the hash table at startup and updates `beesstats.txt` hourly

+Most other threads have names that are derived from the current dedupe
+task that they are executing:
+
+ * `ref_205ad76b1000_24K_50`:  extent scan performing dedupe of btrfs extent bytenr `205ad76b1000`, which is 24 KiB long and has 50 references
+ * `extent_250_32M_16E`:  extent scan searching for extents between 32 MiB + 1 and 16 EiB bytes long, tracking scan position in virtual subvol `250`.
+ * `crawl_378_18916`:  subvol scan searching for extent refs in subvol `378`, inode `18916`.
+
 ### Dump kernel stacks of hung processes

 Check the kernel stacks of all blocked kernel processes:
@@ -91,7 +91,7 @@ bees Crashes
        (gdb) thread apply all bt full

  The last line generates megabytes of output and will often crash gdb.
-  This is OK, submit whatever output gdb can produce.
+  Submit whatever output gdb can produce.

  **Note that this output may include filenames or data from your
  filesystem.**
@@ -160,8 +160,7 @@ Kernel crashes, corruption, and filesystem damage
 -------------------------------------------------

 bees doesn't do anything that _should_ cause corruption or data loss;
-however, [btrfs has kernel bugs](btrfs-kernel.md) and [interacts poorly
-with some Linux block device layers](btrfs-other.md), so corruption is
+however, [btrfs has kernel bugs](btrfs-kernel.md), so corruption is
 not impossible.

 Issues with the btrfs filesystem kernel code or other block device layers
--- a/include/crucible/btrfs-tree.h
+++ b/include/crucible/btrfs-tree.h
@@ -173,34 +173,42 @@ namespace crucible {
 		void get_sums(uint64_t logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t count)> output);
 	};

-	/// Fetch extent items from extent tree
+	/// Fetch extent items from extent tree.
+	/// Does not filter out metadata!  See BtrfsDataExtentTreeFetcher for that.
 	class BtrfsExtentItemFetcher : public BtrfsTreeObjectFetcher {
 	public:
 		BtrfsExtentItemFetcher(const Fd &fd);
 	};

-	/// Fetch extent refs from an inode
+	/// Fetch extent refs from an inode.  Caller must set the tree and objectid.
 	class BtrfsExtentDataFetcher : public BtrfsTreeOffsetFetcher {
 	public:
 		BtrfsExtentDataFetcher(const Fd &fd);
 	};

-	/// Fetch inodes from a subvol
-	class BtrfsFsTreeFetcher : public BtrfsTreeObjectFetcher {
-	public:
-		BtrfsFsTreeFetcher(const Fd &fd, uint64_t subvol);
-	};
-
+	/// Fetch raw inode items
 	class BtrfsInodeFetcher : public BtrfsTreeObjectFetcher {
 	public:
 		BtrfsInodeFetcher(const Fd &fd);
 		BtrfsTreeItem stat(uint64_t subvol, uint64_t inode);
 	};

+	/// Fetch a root (subvol) item
 	class BtrfsRootFetcher : public BtrfsTreeObjectFetcher {
 	public:
 		BtrfsRootFetcher(const Fd &fd);
 		BtrfsTreeItem root(uint64_t subvol);
+		BtrfsTreeItem root_backref(uint64_t subvol);
+	};
+
+	/// Fetch data extent items from extent tree, skipping metadata-only block groups
+	class BtrfsDataExtentTreeFetcher : public BtrfsExtentItemFetcher {
+		BtrfsTreeItem		m_current_bg;
+		BtrfsTreeOffsetFetcher	m_chunk_tree;
+	protected:
+		virtual void next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr) override;
+	public:
+		BtrfsDataExtentTreeFetcher(const Fd &fd);
 	};

 }
--- a/include/crucible/btrfs.h
+++ b/include/crucible/btrfs.h
@@ -78,9 +78,6 @@ enum btrfs_compression_type {
 	#define BTRFS_SHARED_BLOCK_REF_KEY      182
 	#define BTRFS_SHARED_DATA_REF_KEY       184
 	#define BTRFS_BLOCK_GROUP_ITEM_KEY 192
-	#define BTRFS_FREE_SPACE_INFO_KEY 198
-	#define BTRFS_FREE_SPACE_EXTENT_KEY 199
-	#define BTRFS_FREE_SPACE_BITMAP_KEY 200
 	#define BTRFS_DEV_EXTENT_KEY    204
 	#define BTRFS_DEV_ITEM_KEY      216
 	#define BTRFS_CHUNK_ITEM_KEY    228
@@ -97,6 +94,18 @@ enum btrfs_compression_type {

 #endif

+#ifndef BTRFS_FREE_SPACE_INFO_KEY
+	#define BTRFS_FREE_SPACE_INFO_KEY 198
+	#define BTRFS_FREE_SPACE_EXTENT_KEY 199
+	#define BTRFS_FREE_SPACE_BITMAP_KEY 200
+	#define BTRFS_FREE_SPACE_OBJECTID -11ULL
+#endif
+
+#ifndef BTRFS_BLOCK_GROUP_RAID1C4
+	#define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
+	#define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
+#endif
+
 #ifndef BTRFS_DEFRAG_RANGE_START_IO

 	// For some reason uapi has BTRFS_DEFRAG_RANGE_COMPRESS and
--- a/include/crucible/fs.h
+++ b/include/crucible/fs.h
@@ -201,11 +201,13 @@ namespace crucible {
 		static thread_local size_t s_calls;
 		static thread_local size_t s_loops;
 		static thread_local size_t s_loops_empty;
+		static thread_local shared_ptr<ostream> s_debug_ostream;
 	};

 	ostream & operator<<(ostream &os, const btrfs_ioctl_search_key &key);
 	ostream & operator<<(ostream &os, const BtrfsIoctlSearchKey &key);

+	string btrfs_chunk_type_ntoa(uint64_t type);
 	string btrfs_search_type_ntoa(unsigned type);
 	string btrfs_search_objectid_ntoa(uint64_t objectid);
 	string btrfs_compress_type_ntoa(uint8_t type);
@@ -246,9 +248,11 @@ namespace crucible {
 	struct BtrfsIoctlFsInfoArgs : public btrfs_ioctl_fs_info_args_v3 {
 		BtrfsIoctlFsInfoArgs();
 		void do_ioctl(int fd);
+		bool do_ioctl_nothrow(int fd);
 		uint16_t csum_type() const;
 		uint16_t csum_size() const;
 		uint64_t generation() const;
+		vector<uint8_t> fsid() const;
 	};

 	ostream & operator<<(ostream &os, const BtrfsIoctlFsInfoArgs &a);
--- a/include/crucible/hexdump.h
+++ b/include/crucible/hexdump.h
@@ -13,7 +13,7 @@ namespace crucible {
 	hexdump(ostream &os, const V &v)
 	{
 		const auto v_size = v.size();
-		const uint8_t* const v_data = reinterpret_cast<uint8_t*>(v.data());
+		const uint8_t* const v_data = reinterpret_cast<const uint8_t*>(v.data());
 		os << "V { size = " << v_size << ", data:\n";
 		for (size_t i = 0; i < v_size; i += 8) {
 			string hex, ascii;
--- a/include/crucible/lockset.h
+++ b/include/crucible/lockset.h
@@ -117,7 +117,7 @@ namespace crucible {
 		while (full() || locked(name)) {
 			m_condvar.wait(lock);
 		}
-		auto rv = m_set.insert(make_pair(name, crucible::gettid()));
+		auto rv = m_set.insert(make_pair(name, gettid()));
 		THROW_CHECK0(runtime_error, rv.second);
 	}

@@ -129,7 +129,7 @@ namespace crucible {
 		if (full() || locked(name)) {
 			return false;
 		}
-		auto rv = m_set.insert(make_pair(name, crucible::gettid()));
+		auto rv = m_set.insert(make_pair(name, gettid()));
 		THROW_CHECK1(runtime_error, name, rv.second);
 		return true;
 	}
--- a/include/crucible/openat2.h
+++ b/include/crucible/openat2.h
@@ -0,0 +1,52 @@
+#ifndef CRUCIBLE_OPENAT2_H
+#define CRUCIBLE_OPENAT2_H
+
+#include <cstdlib>
+
+// Compatibility for building on old libc for new kernel
+#include <linux/version.h>
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0)
+
+#include <linux/openat2.h>
+
+#else
+
+#include <linux/types.h>
+
+#ifndef RESOLVE_NO_XDEV
+#define RESOLVE_NO_XDEV 1
+
+// RESOLVE_NO_XDEV was there from the beginning of openat2,
+// so if that's missing, so is open_how
+
+struct open_how {
+	__u64 flags;
+	__u64 mode;
+	__u64 resolve;
+};
+#endif
+
+#ifndef RESOLVE_NO_MAGICLINKS
+#define RESOLVE_NO_MAGICLINKS 2
+#endif
+#ifndef RESOLVE_NO_SYMLINKS
+#define RESOLVE_NO_SYMLINKS 4
+#endif
+#ifndef RESOLVE_BENEATH
+#define RESOLVE_BENEATH 8
+#endif
+#ifndef RESOLVE_IN_ROOT
+#define RESOLVE_IN_ROOT 16
+#endif
+
+#endif // Linux version >= v5.6
+
+extern "C" {
+
+/// Weak symbol to support libc with no syscall wrapper
+int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size) throw();
+
+};
+
+#endif // CRUCIBLE_OPENAT2_H
--- a/include/crucible/process.h
+++ b/include/crucible/process.h
@@ -10,6 +10,10 @@
 #include <sys/wait.h>
 #include <unistd.h>

+extern "C" {
+	pid_t gettid() throw();
+};
+
 namespace crucible {
 	using namespace std;

@@ -73,7 +77,6 @@ namespace crucible {

 	typedef ResourceHandle<Process::id, Process> Pid;

-	pid_t gettid();
 	double getloadavg1();
 	double getloadavg5();
 	double getloadavg15();
--- a/include/crucible/seeker.h
+++ b/include/crucible/seeker.h
@@ -6,23 +6,23 @@
 #include <algorithm>
 #include <limits>

-#include <cstdint>
-
-#if 1
+// Debug stream
+#include <memory>
 #include <iostream>
 #include <sstream>
-#define DINIT(__x) __x
-#define DLOG(__x) do { logs << __x << std::endl; } while (false)
-#define DOUT(__err) do { __err << logs.str(); } while (false)
-#else
-#define DINIT(__x) do {} while (false)
-#define DLOG(__x) do {} while (false)
-#define DOUT(__x) do {} while (false)
-#endif
+
+#include <cstdint>

 namespace crucible {
 	using namespace std;

+	extern thread_local shared_ptr<ostream> tl_seeker_debug_str;
+	#define SEEKER_DEBUG_LOG(__x) do { \
+		if (tl_seeker_debug_str) { \
+			(*tl_seeker_debug_str) << __x << "\n"; \
+		} \
+	} while (false)
+
 	// Requirements for Container<Pos> Fetch(Pos lower, Pos upper):
 	// - fetches objects in Pos order, starting from lower (must be >= lower)
 	// - must return upper if present, may or may not return objects after that
@@ -49,113 +49,108 @@ namespace crucible {
 	Pos
 	seek_backward(Pos const target_pos, Fetch fetch, Pos min_step = 1, size_t max_loops = numeric_limits<size_t>::max())
 	{
-		DINIT(ostringstream logs);
-		try {
-			static const Pos end_pos = numeric_limits<Pos>::max();
-			// TBH this probably won't work if begin_pos != 0, i.e. any signed type
-			static const Pos begin_pos = numeric_limits<Pos>::min();
-			// Run a binary search looking for the highest key below target_pos.
-			// Initial upper bound of the search is target_pos.
-			// Find initial lower bound by doubling the size of the range until a key below target_pos
-			// is found, or the lower bound reaches the beginning of the search space.
-			// If the lower bound search reaches the beginning of the search space without finding a key,
-			// return the beginning of the search space; otherwise, perform a binary search between
-			// the bounds now established.
-			Pos lower_bound = 0;
-			Pos upper_bound = target_pos;
-			bool found_low = false;
-			Pos probe_pos = target_pos;
-			// We need one loop for each bit of the search space to find the lower bound,
-			// one loop for each bit of the search space to find the upper bound,
-			// and one extra loop to confirm the boundary is correct.
-			for (size_t loop_count = min(numeric_limits<Pos>::digits * size_t(2) + 1, max_loops); loop_count; --loop_count) {
-				DLOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
-				auto result = fetch(probe_pos, target_pos);
-				const Pos low_pos = result.empty() ? end_pos : *result.begin();
-				const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
-				DLOG(" = " << low_pos << ".." << high_pos);
-				// check for correct behavior of the fetch function
-				THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
-				THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
-				THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
-				if (!found_low) {
-					// if target_pos == end_pos then we will find it in every empty result set,
-					// so in that case we force the lower bound to be lower than end_pos
-					if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
-						// found a lower bound, set the low bound there and switch to binary search
-						found_low = true;
-						lower_bound = low_pos;
-						DLOG("found_low = true, lower_bound = " << lower_bound);
-					} else {
-						// still looking for lower bound
-						// if probe_pos was begin_pos then we can stop with no result
-						if (probe_pos == begin_pos) {
-							DLOG("return: probe_pos == begin_pos " << begin_pos);
-							return begin_pos;
-						}
-						// double the range size, or use the distance between objects found so far
-						THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
-						// already checked low_pos <= high_pos above
-						const Pos want_delta = max(upper_bound - probe_pos, min_step);
-						// avoid underflowing the beginning of the search space
-						const Pos have_delta = min(want_delta, probe_pos - begin_pos);
-						THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
-						// move probe and try again
-						probe_pos = probe_pos - have_delta;
-						DLOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
-						continue;
+		static const Pos end_pos = numeric_limits<Pos>::max();
+		// TBH this probably won't work if begin_pos != 0, i.e. any signed type
+		static const Pos begin_pos = numeric_limits<Pos>::min();
+		// Run a binary search looking for the highest key below target_pos.
+		// Initial upper bound of the search is target_pos.
+		// Find initial lower bound by doubling the size of the range until a key below target_pos
+		// is found, or the lower bound reaches the beginning of the search space.
+		// If the lower bound search reaches the beginning of the search space without finding a key,
+		// return the beginning of the search space; otherwise, perform a binary search between
+		// the bounds now established.
+		Pos lower_bound = 0;
+		Pos upper_bound = target_pos;
+		bool found_low = false;
+		Pos probe_pos = target_pos;
+		// We need one loop for each bit of the search space to find the lower bound,
+		// one loop for each bit of the search space to find the upper bound,
+		// and one extra loop to confirm the boundary is correct.
+		for (size_t loop_count = min((1 + numeric_limits<Pos>::digits) * size_t(2), max_loops); loop_count; --loop_count) {
+			SEEKER_DEBUG_LOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
+			auto result = fetch(probe_pos, target_pos);
+			const Pos low_pos = result.empty() ? end_pos : *result.begin();
+			const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
+			SEEKER_DEBUG_LOG(" = " << low_pos << ".." << high_pos);
+			// check for correct behavior of the fetch function
+			THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
+			THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
+			THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
+			if (!found_low) {
+				// if target_pos == end_pos then we will find it in every empty result set,
+				// so in that case we force the lower bound to be lower than end_pos
+				if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
+					// found a lower bound, set the low bound there and switch to binary search
+					found_low = true;
+					lower_bound = low_pos;
+					SEEKER_DEBUG_LOG("found_low = true, lower_bound = " << lower_bound);
+				} else {
+					// still looking for lower bound
+					// if probe_pos was begin_pos then we can stop with no result
+					if (probe_pos == begin_pos) {
+						SEEKER_DEBUG_LOG("return: probe_pos == begin_pos " << begin_pos);
+						return begin_pos;
 					}
+					// double the range size, or use the distance between objects found so far
+					THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
+					// already checked low_pos <= high_pos above
+					const Pos want_delta = max(upper_bound - probe_pos, min_step);
+					// avoid underflowing the beginning of the search space
+					const Pos have_delta = min(want_delta, probe_pos - begin_pos);
+					THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
+					// move probe and try again
+					probe_pos = probe_pos - have_delta;
+					SEEKER_DEBUG_LOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
+					continue;
 				}
-				if (low_pos <= target_pos && target_pos <= high_pos) {
-					// have keys on either side of target_pos in result
-					// search from the high end until we find the highest key below target
-					for (auto i = result.rbegin(); i != result.rend(); ++i) {
-						// more correctness checking for fetch
-						THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
-						if (*i <= target_pos) {
-							DLOG("return: *i " << *i << " <= target_pos " << target_pos);
-							return *i;
-						}
-					}
-					// if the list is empty then low_pos = high_pos = end_pos
-					// if target_pos = end_pos also, then we will execute the loop
-					// above but not find any matching entries.
-					THROW_CHECK0(runtime_error, result.empty());
-				}
-				if (target_pos <= low_pos) {
-					// results are all too high, so probe_pos..low_pos is too high
-					// lower the high bound to the probe pos
-					upper_bound = probe_pos;
-					DLOG("upper_bound = probe_pos " << probe_pos);
-				}
-				if (high_pos < target_pos) {
-					// results are all too low, so probe_pos..high_pos is too low
-					// raise the low bound to the high_pos
-					DLOG("lower_bound = high_pos " << high_pos);
-					lower_bound = high_pos;
-				}
-				// compute a new probe pos at the middle of the range and try again
-				// we can't have a zero-size range here because we would not have set found_low yet
-				THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
-				const Pos delta = (upper_bound - lower_bound) / 2;
-				probe_pos = lower_bound + delta;
-				if (delta < 1) {
-					// nothing can exist in the range (lower_bound, upper_bound)
-					// and an object is known to exist at lower_bound
-					DLOG("return: probe_pos == lower_bound " << lower_bound);
-					return lower_bound;
-				}
-				THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
-				THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
-				DLOG("loop: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
 			}
-			THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
-				"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
-				"found_low " << found_low);
-		} catch (...) {
-			DOUT(cerr);
-			throw;
+			if (low_pos <= target_pos && target_pos <= high_pos) {
+				// have keys on either side of target_pos in result
+				// search from the high end until we find the highest key below target
+				for (auto i = result.rbegin(); i != result.rend(); ++i) {
+					// more correctness checking for fetch
+					THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
+					if (*i <= target_pos) {
+						SEEKER_DEBUG_LOG("return: *i " << *i << " <= target_pos " << target_pos);
+						return *i;
+					}
+				}
+				// if the list is empty then low_pos = high_pos = end_pos
+				// if target_pos = end_pos also, then we will execute the loop
+				// above but not find any matching entries.
+				THROW_CHECK0(runtime_error, result.empty());
+			}
+			if (target_pos <= low_pos) {
+				// results are all too high, so probe_pos..low_pos is too high
+				// lower the high bound to the probe pos, low_pos cannot be lower
+				SEEKER_DEBUG_LOG("upper_bound = probe_pos " << probe_pos);
+				upper_bound = probe_pos;
+			}
+			if (high_pos < target_pos) {
+				// results are all too low, so probe_pos..high_pos is too low
+				// raise the low bound to high_pos but not above upper_bound
+				const auto next_pos = min(high_pos, upper_bound);
+				SEEKER_DEBUG_LOG("lower_bound = next_pos " << next_pos);
+				lower_bound = next_pos;
+			}
+			// compute a new probe pos at the middle of the range and try again
+			// we can't have a zero-size range here because we would not have set found_low yet
+			THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
+			const Pos delta = (upper_bound - lower_bound) / 2;
+			probe_pos = lower_bound + delta;
+			if (delta < 1) {
+				// nothing can exist in the range (lower_bound, upper_bound)
+				// and an object is known to exist at lower_bound
+				SEEKER_DEBUG_LOG("return: probe_pos == lower_bound " << lower_bound);
+				return lower_bound;
+			}
+			THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
+			THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
+			SEEKER_DEBUG_LOG("loop bottom: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
 		}
+		THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
+			"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
+			"found_low " << found_low);
 	}
 }

--- a/include/crucible/task.h
+++ b/include/crucible/task.h
@@ -47,6 +47,10 @@ namespace crucible {
 		/// been destroyed.
 		void append(const Task &task) const;

+		/// Schedule Task to run after this Task has run or
+		/// been destroyed, in Task ID order.
+		void insert(const Task &task) const;
+
 		/// Describe Task as text.
 		string title() const;

@@ -172,9 +176,6 @@ namespace crucible {
 		/// objects it holds, and exit its Task function.
 		ExclusionLock try_lock(const Task &task);

-		/// Execute Task when Exclusion is unlocked (possibly
-		/// immediately).
-		void insert_task(const Task &t);
 	};

 	/// Wrapper around pthread_setname_np which handles length limits
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -14,8 +14,10 @@ CRUCIBLE_OBJS = \
 	fs.o \
 	multilock.o \
 	ntoa.o \
+	openat2.o \
 	path.o \
 	process.o \
+	seeker.o \
 	string.o \
 	table.o \
 	task.o \
--- a/lib/btrfs-tree.cc
+++ b/lib/btrfs-tree.cc
@@ -5,6 +5,12 @@
 #include "crucible/hexdump.h"
 #include "crucible/seeker.h"

+#define CRUCIBLE_BTRFS_TREE_DEBUG(x) do { \
+	if (BtrfsIoctlSearchKey::s_debug_ostream) { \
+		(*BtrfsIoctlSearchKey::s_debug_ostream) << x; \
+	} \
+} while (false)
+
 namespace crucible {
 	using namespace std;

@@ -355,6 +361,7 @@ namespace crucible {
 	BtrfsTreeItem
 	BtrfsTreeFetcher::at(uint64_t logical)
 	{
+		CRUCIBLE_BTRFS_TREE_DEBUG("at " << logical);
 		BtrfsIoctlSearchKey &sk = m_sk;
 		fill_sk(sk, logical);
 		// Exact match, should return 0 or 1 items
@@ -397,53 +404,59 @@ namespace crucible {
 	BtrfsTreeFetcher::rlower_bound(uint64_t logical)
 	{
 	#if 0
-	#define BTFRLB_DEBUG(x) do { cerr << x; } while (false)
+		static bool btfrlb_debug = getenv("BTFLRB_DEBUG");
+	#define BTFRLB_DEBUG(x) do { if (btfrlb_debug) cerr << x; } while (false)
 	#else
-	#define BTFRLB_DEBUG(x) do { } while (false)
+	#define BTFRLB_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
 	#endif
 		BtrfsTreeItem closest_item;
 		uint64_t closest_logical = 0;
 		BtrfsIoctlSearchKey &sk = m_sk;
 		size_t loops = 0;
-		BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << endl);
-		seek_backward(scale_logical(logical), [&](uint64_t lower_bound, uint64_t upper_bound) {
+		BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << " in tree " << tree() << endl);
+		seek_backward(scale_logical(logical), [&](uint64_t const lower_bound, uint64_t const upper_bound) {
 			++loops;
 			fill_sk(sk, unscale_logical(min(scaled_max_logical(), lower_bound)));
 			set<uint64_t> rv;
+			bool too_far = false;
 			do {
 				sk.nr_items = 4;
 				sk.do_ioctl(fd());
 				BTFRLB_DEBUG("fetch: loop " << loops << " lower_bound..upper_bound " << to_hex(lower_bound) << ".." << to_hex(upper_bound));
 				for (auto &i : sk.m_result) {
 					next_sk(sk, i);
-					const auto this_logical = hdr_logical(i);
-					const auto scaled_hdr_logical = scale_logical(this_logical);
-					BTFRLB_DEBUG(" " << to_hex(scaled_hdr_logical));
-					if (hdr_match(i)) {
-						if (this_logical <= logical && this_logical > closest_logical) {
-							closest_logical = this_logical;
-							closest_item = i;
-						}
-						BTFRLB_DEBUG("(match)");
-						rv.insert(scaled_hdr_logical);
-					}
-					if (scaled_hdr_logical > upper_bound || hdr_stop(i)) {
-						if (scaled_hdr_logical >= upper_bound) {
-							BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
-						}
-						if (hdr_stop(i)) {
-							rv.insert(numeric_limits<uint64_t>::max());
-							BTFRLB_DEBUG("(stop)");
-						}
+					// If hdr_stop or !hdr_match, don't inspect the item
+					if (hdr_stop(i)) {
+						too_far = true;
+						rv.insert(numeric_limits<uint64_t>::max());
+						BTFRLB_DEBUG("(stop)");
 						break;
-					} else {
-						BTFRLB_DEBUG("(cont'd)");
 					}
+					if (!hdr_match(i)) {
+						BTFRLB_DEBUG("(no match)");
+						continue;
+					}
+					const auto this_logical = hdr_logical(i);
+					BTFRLB_DEBUG(" " << to_hex(this_logical) << " " << i);
+					const auto scaled_hdr_logical = scale_logical(this_logical);
+					BTFRLB_DEBUG(" " << "(match)");
+					if (scaled_hdr_logical > upper_bound) {
+						too_far = true;
+						BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
+						break;
+					}
+					if (this_logical <= logical && this_logical > closest_logical) {
+						closest_logical = this_logical;
+						closest_item = i;
+						BTFRLB_DEBUG("(closest)");
+					}
+					rv.insert(scaled_hdr_logical);
+					BTFRLB_DEBUG("(cont'd)");
 				}
 				BTFRLB_DEBUG(endl);
 				// We might get a search result that contains only non-matching items.
 				// Keep looping until we find any matching item or we run out of tree.
-			} while (rv.empty() && !sk.m_result.empty());
+			} while (!too_far && rv.empty() && !sk.m_result.empty());
 			return rv;
 		}, scale_logical(lookbehind_size()));
 		return closest_item;
@@ -474,6 +487,7 @@ namespace crucible {
 	BtrfsTreeItem
 	BtrfsTreeFetcher::next(uint64_t logical)
 	{
+		CRUCIBLE_BTRFS_TREE_DEBUG("next " << logical);
 		const auto scaled_logical = scale_logical(logical);
 		if (scaled_logical + 1 > scaled_max_logical()) {
 			return BtrfsTreeItem();
@@ -484,6 +498,7 @@ namespace crucible {
 	BtrfsTreeItem
 	BtrfsTreeFetcher::prev(uint64_t logical)
 	{
+		CRUCIBLE_BTRFS_TREE_DEBUG("prev " << logical);
 		const auto scaled_logical = scale_logical(logical);
 		if (scaled_logical < 1) {
 			return BtrfsTreeItem();
@@ -568,9 +583,10 @@ namespace crucible {
 	BtrfsCsumTreeFetcher::get_sums(uint64_t const logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t bytes)> output)
 	{
 	#if 0
-	#define BCTFGS_DEBUG(x) do { cerr << x; } while (false)
+		static bool bctfgs_debug = getenv("BCTFGS_DEBUG");
+	#define BCTFGS_DEBUG(x) do { if (bctfgs_debug) cerr << x; } while (false)
 	#else
-	#define BCTFGS_DEBUG(x) do { } while (false)
+	#define BCTFGS_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
 	#endif
 		const uint64_t logical_end = logical + count * block_size();
 		BtrfsTreeItem bti = rlower_bound(logical);
@@ -662,14 +678,6 @@ namespace crucible {
 		type(BTRFS_EXTENT_DATA_KEY);
 	}

-	BtrfsFsTreeFetcher::BtrfsFsTreeFetcher(const Fd &new_fd, uint64_t subvol) :
-		BtrfsTreeObjectFetcher(new_fd)
-	{
-		tree(subvol);
-		type(BTRFS_EXTENT_DATA_KEY);
-		scale_size(1);
-	}
-
 	BtrfsInodeFetcher::BtrfsInodeFetcher(const Fd &fd) :
 		BtrfsTreeObjectFetcher(fd)
 	{
@@ -693,18 +701,86 @@ namespace crucible {
 		BtrfsTreeObjectFetcher(fd)
 	{
 		tree(BTRFS_ROOT_TREE_OBJECTID);
-		type(BTRFS_ROOT_ITEM_KEY);
 		scale_size(1);
 	}

 	BtrfsTreeItem
-	BtrfsRootFetcher::root(uint64_t subvol)
+	BtrfsRootFetcher::root(const uint64_t subvol)
 	{
+		const auto my_type = BTRFS_ROOT_ITEM_KEY;
+		type(my_type);
 		const auto item = at(subvol);
 		if (!!item) {
 			THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
-			THROW_CHECK2(runtime_error, item.type(), BTRFS_ROOT_ITEM_KEY, item.type() == BTRFS_ROOT_ITEM_KEY);
+			THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
 		}
 		return item;
 	}
+
+	BtrfsTreeItem
+	BtrfsRootFetcher::root_backref(const uint64_t subvol)
+	{
+		const auto my_type = BTRFS_ROOT_BACKREF_KEY;
+		type(my_type);
+		const auto item = at(subvol);
+		if (!!item) {
+			THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
+			THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
+		}
+		return item;
+	}
+
+	BtrfsDataExtentTreeFetcher::BtrfsDataExtentTreeFetcher(const Fd &fd) :
+		BtrfsExtentItemFetcher(fd),
+		m_chunk_tree(fd)
+	{
+		tree(BTRFS_EXTENT_TREE_OBJECTID);
+		type(BTRFS_EXTENT_ITEM_KEY);
+		m_chunk_tree.tree(BTRFS_CHUNK_TREE_OBJECTID);
+		m_chunk_tree.type(BTRFS_CHUNK_ITEM_KEY);
+		m_chunk_tree.objectid(BTRFS_FIRST_CHUNK_TREE_OBJECTID);
+	}
+
+	void
+	BtrfsDataExtentTreeFetcher::next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr)
+	{
+		key.min_type = key.max_type = type();
+		key.max_objectid = key.max_offset = numeric_limits<uint64_t>::max();
+		key.min_offset = 0;
+		key.min_objectid = hdr.objectid;
+		const auto step = scale_size();
+		if (key.min_objectid < numeric_limits<uint64_t>::max() - step) {
+			key.min_objectid += step;
+		} else {
+			key.min_objectid = numeric_limits<uint64_t>::max();
+		}
+		// If we're still in our current block group, check here
+		if (!!m_current_bg) {
+			const auto bg_begin = m_current_bg.offset();
+			const auto bg_end = bg_begin + m_current_bg.chunk_length();
+			// If we are still in our current block group, return early
+			if (key.min_objectid >= bg_begin && key.min_objectid < bg_end) return;
+		}
+		// We don't have a current block group or we're out of range
+		// Find the chunk that this bytenr belongs to
+		m_current_bg = m_chunk_tree.rlower_bound(key.min_objectid);
+		// Make sure it's a data block group
+		while (!!m_current_bg) {
+			// Data block group, stop here
+			if (m_current_bg.chunk_type() & BTRFS_BLOCK_GROUP_DATA) break;
+			// Not a data block group, skip to end
+			key.min_objectid = m_current_bg.offset() + m_current_bg.chunk_length();
+			m_current_bg = m_chunk_tree.lower_bound(key.min_objectid);
+		}
+		if (!m_current_bg) {
+			// Ran out of data block groups, stop here
+			return;
+		}
+		// Check to see if bytenr is in the current data block group
+		const auto bg_begin = m_current_bg.offset();
+		if (key.min_objectid < bg_begin) {
+			// Move forward to start of data block group
+			key.min_objectid = bg_begin;
+		}
+	}
 }
--- a/lib/chatter.cc
+++ b/lib/chatter.cc
@@ -76,7 +76,7 @@ namespace crucible {
 			DIE_IF_ZERO(strftime(buf, sizeof(buf), "%Y-%m-%d %H:%M:%S", &ltm));

 			header_stream << buf;
-			header_stream << " " << getpid() << "." << crucible::gettid();
+			header_stream << " " << getpid() << "." << gettid();
 			if (add_prefix_level) {
 				header_stream << "<" << m_loglevel << ">";
 			}
@@ -88,7 +88,7 @@ namespace crucible {
 				header_stream << "<" << m_loglevel << ">";
 			}
 			header_stream << (m_name.empty() ? "thread" : m_name);
-			header_stream << "[" << crucible::gettid() << "]";
+			header_stream << "[" << gettid() << "]";
 		}

 		header_stream << ": ";
--- a/lib/fs.cc
+++ b/lib/fs.cc
@@ -757,6 +757,7 @@ namespace crucible {
 	thread_local size_t BtrfsIoctlSearchKey::s_calls = 0;
 	thread_local size_t BtrfsIoctlSearchKey::s_loops = 0;
 	thread_local size_t BtrfsIoctlSearchKey::s_loops_empty = 0;
+	thread_local shared_ptr<ostream> BtrfsIoctlSearchKey::s_debug_ostream;

 	bool
 	BtrfsIoctlSearchKey::do_ioctl_nothrow(int fd)
@@ -776,6 +777,9 @@ namespace crucible {
 			ioctl_ptr = ioctl_arg.get<btrfs_ioctl_search_args_v2>();
 			ioctl_ptr->key = static_cast<const btrfs_ioctl_search_key&>(*this);
 			ioctl_ptr->buf_size = buf_size;
+			if (s_debug_ostream) {
+				(*s_debug_ostream) << "bisk " << (ioctl_ptr->key) << "\n";
+			}
 			// Don't bother supporting V1.  Kernels that old have other problems.
 			int rv = ioctl(fd, BTRFS_IOC_TREE_SEARCH_V2, ioctl_arg.data());
 			++s_calls;
@@ -881,6 +885,26 @@ namespace crucible {
 		}
 	}

+	string
+	btrfs_chunk_type_ntoa(uint64_t type)
+	{
+		static const bits_ntoa_table table[] = {
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DATA),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_METADATA),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_SYSTEM),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DUP),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID0),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID10),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C3),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C4),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID5),
+			NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID6),
+			NTOA_TABLE_ENTRY_END()
+		};
+		return bits_ntoa(type, table);
+	}
+
 	string
 	btrfs_search_type_ntoa(unsigned type)
 	{
@@ -908,15 +932,9 @@ namespace crucible {
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_BLOCK_REF_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_DATA_REF_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_BLOCK_GROUP_ITEM_KEY),
-#ifdef BTRFS_FREE_SPACE_INFO_KEY
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_INFO_KEY),
-#endif
-#ifdef BTRFS_FREE_SPACE_EXTENT_KEY
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_EXTENT_KEY),
-#endif
-#ifdef BTRFS_FREE_SPACE_BITMAP_KEY
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_BITMAP_KEY),
-#endif
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_EXTENT_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_ITEM_KEY),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_CHUNK_ITEM_KEY),
@@ -948,9 +966,7 @@ namespace crucible {
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_CSUM_TREE_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_QUOTA_TREE_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_UUID_TREE_OBJECTID),
-#ifdef BTRFS_FREE_SPACE_TREE_OBJECTID
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_TREE_OBJECTID),
-#endif
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_BALANCE_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_ORPHAN_OBJECTID),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_TREE_LOG_OBJECTID),
@@ -1138,11 +1154,17 @@ namespace crucible {
 	{
 	}

-	void
-	BtrfsIoctlFsInfoArgs::do_ioctl(int fd)
+	bool
+	BtrfsIoctlFsInfoArgs::do_ioctl_nothrow(int const fd)
 	{
 		btrfs_ioctl_fs_info_args_v3 *p = static_cast<btrfs_ioctl_fs_info_args_v3 *>(this);
-		if (ioctl(fd, BTRFS_IOC_FS_INFO, p)) {
+		return 0 == ioctl(fd, BTRFS_IOC_FS_INFO, p);
+	}
+
+	void
+	BtrfsIoctlFsInfoArgs::do_ioctl(int const fd)
+	{
+		if (!do_ioctl_nothrow(fd)) {
 			THROW_ERRNO("BTRFS_IOC_FS_INFO: fd " << fd);
 		}
 	}
@@ -1159,6 +1181,13 @@ namespace crucible {
 		return this->btrfs_ioctl_fs_info_args_v3::csum_size;
 	}

+	vector<uint8_t>
+	BtrfsIoctlFsInfoArgs::fsid() const
+	{
+		const auto begin = btrfs_ioctl_fs_info_args_v3::fsid;
+		return vector<uint8_t>(begin, begin + BTRFS_FSID_SIZE);
+	}
+
 	uint64_t
 	BtrfsIoctlFsInfoArgs::generation() const
 	{
--- a/lib/openat2.cc
+++ b/lib/openat2.cc
@@ -0,0 +1,40 @@
+#include "crucible/openat2.h"
+
+#include <sys/syscall.h>
+
+// Compatibility for building on old libc for new kernel
+
+#if LINUX_VERSION_CODE < KERNEL_VERSION(5, 6, 0)
+
+// Every arch that defines this uses 437, except Alpha, where 437 is
+// mq_getsetattr.
+
+#ifndef SYS_openat2
+#ifdef __alpha__
+#define SYS_openat2 547
+#else
+#define SYS_openat2 437
+#endif
+#endif
+
+#endif // Linux version >= v5.6
+
+#include <fcntl.h>
+#include <unistd.h>
+
+extern "C" {
+
+int
+__attribute__((weak))
+openat2(int const dirfd, const char *const pathname, struct open_how *const how, size_t const size)
+throw()
+{
+#ifdef SYS_openat2
+	return syscall(SYS_openat2, dirfd, pathname, how, size);
+#else
+	errno = ENOSYS;
+	return -1;
+#endif
+}
+
+};
--- a/lib/process.cc
+++ b/lib/process.cc
@@ -7,13 +7,18 @@
 #include <cstdlib>
 #include <utility>

-// for gettid()
-#ifndef _GNU_SOURCE
-#define _GNU_SOURCE
-#endif
 #include <unistd.h>
 #include <sys/syscall.h>

+extern "C" {
+	pid_t
+	__attribute__((weak))
+	gettid() throw()
+	{
+		return syscall(SYS_gettid);
+	}
+};
+
 namespace crucible {
 	using namespace std;

@@ -111,12 +116,6 @@ namespace crucible {
 		}
 	}

-	pid_t
-	gettid()
-	{
-		return syscall(SYS_gettid);
-	}
-
 	double
 	getloadavg1()
 	{
--- a/lib/seeker.cc
+++ b/lib/seeker.cc
@@ -0,0 +1,7 @@
+#include "crucible/seeker.h"
+
+namespace crucible {
+
+	thread_local shared_ptr<ostream> tl_seeker_debug_str;
+
+};
--- a/lib/task.cc
+++ b/lib/task.cc
@@ -76,13 +76,24 @@ namespace crucible {
 		/// Tasks to be executed after the current task is executed
 		list<TaskStatePtr>			m_post_exec_queue;

-		/// Set by run() and append().  Cleared by exec().
+		/// Set by run(), append(), and insert().  Cleared by exec().
 		bool					m_run_now = false;

+		/// Set by insert().  Cleared by exec() and destructor.
+		bool					m_sort_queue = false;
+
 		/// Set when task starts execution by exec().
 		/// Cleared when exec() ends.
 		bool					m_is_running = false;

+		/// Set when task is queued while already running.
+		/// Cleared when task is requeued.
+		bool					m_run_again = false;
+
+		/// Set when task is queued as idle task while already running.
+		/// Cleared when task is queued as non-idle task.
+		bool					m_idle = false;
+
 		/// Sequential identifier for next task
 		static atomic<TaskId>			s_next_id;

@@ -107,7 +118,7 @@ namespace crucible {
 		static void clear_queue(TaskQueue &tq);

 		/// Rescue any TaskQueue, not just this one.
-		static void rescue_queue(TaskQueue &tq);
+		static void rescue_queue(TaskQueue &tq, const bool sort_queue);

 		TaskState &operator=(const TaskState &) = delete;
 		TaskState(const TaskState &) = delete;
@@ -142,6 +153,10 @@ namespace crucible {
 		/// or is destroyed.
 		void append(const TaskStatePtr &task);

+		/// Queue task to execute after current task finishes executing
+		/// or is destroyed, in task ID order.
+		void insert(const TaskStatePtr &task);
+
 		/// How masy Tasks are there?  Good for catching leaks
 		static size_t instance_count();
 	};
@@ -219,16 +234,21 @@ namespace crucible {
 	static auto s_tms = make_shared<TaskMasterState>();

 	void
-	TaskState::rescue_queue(TaskQueue &queue)
+	TaskState::rescue_queue(TaskQueue &queue, const bool sort_queue)
 	{
 		if (queue.empty()) {
 			return;
 		}
-		const auto tlcc = tl_current_consumer;
+		const auto &tlcc = tl_current_consumer;
 		if (tlcc) {
 			// We are executing under a TaskConsumer, splice our post-exec queue at front.
 			// No locks needed because we are using only thread-local objects.
 			tlcc->m_local_queue.splice(tlcc->m_local_queue.begin(), queue);
+			if (sort_queue) {
+				tlcc->m_local_queue.sort([&](const TaskStatePtr &a, const TaskStatePtr &b) {
+					return a->m_id < b->m_id;
+				});
+			}
 		} else {
 			// We are not executing under a TaskConsumer.
 			// If there is only one task, then just insert it at the front of the queue.
@@ -239,6 +259,8 @@ namespace crucible {
 				// then push it to the front of the global queue using normal locking methods.
 				TaskStatePtr rescue_task(make_shared<TaskState>("rescue_task", [](){}));
 				swap(rescue_task->m_post_exec_queue, queue);
+				// Do the sort--once--when a new Consumer has picked up the Task
+				rescue_task->m_sort_queue = sort_queue;
 				TaskQueue tq_one { rescue_task };
 				TaskMasterState::push_front(tq_one);
 			}
@@ -251,7 +273,8 @@ namespace crucible {
 		--s_instance_count;
 		unique_lock<mutex> lock(m_mutex);
 		// If any dependent Tasks were appended since the last exec, run them now
-		TaskState::rescue_queue(m_post_exec_queue);
+		TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
+		// No need to clear m_sort_queue here, it won't exist soon
 	}

 	TaskState::TaskState(string title, function<void()> exec_fn) :
@@ -310,6 +333,24 @@ namespace crucible {
 			task->m_run_now = true;
 			append_nolock(task);
 		}
+		task->m_idle = false;
+	}
+
+	void
+	TaskState::insert(const TaskStatePtr &task)
+	{
+		THROW_CHECK0(invalid_argument, task);
+		THROW_CHECK2(invalid_argument, m_id, task->m_id, m_id != task->m_id);
+		PairLock lock(m_mutex, task->m_mutex);
+		if (!task->m_run_now) {
+			task->m_run_now = true;
+			// Move the task and its post-exec queue to follow this task,
+			// and request a sort of the flattened list.
+			m_sort_queue = true;
+			m_post_exec_queue.push_back(task);
+			m_post_exec_queue.splice(m_post_exec_queue.end(), task->m_post_exec_queue);
+		}
+		task->m_idle = false;
 	}

 	void
@@ -320,7 +361,7 @@ namespace crucible {

 		unique_lock<mutex> lock(m_mutex);
 		if (m_is_running) {
-			append_nolock(shared_from_this());
+			m_run_again = true;
 			return;
 		} else {
 			m_run_now = false;
@@ -344,8 +385,20 @@ namespace crucible {
 		swap(this_task, tl_current_task);
 		m_is_running = false;

+		if (m_run_again) {
+			m_run_again = false;
+			if (m_idle) {
+				// All the way back to the end of the line
+				TaskMasterState::push_back_idle(shared_from_this());
+			} else {
+				// Insert after any dependents waiting for this Task
+				m_post_exec_queue.push_back(shared_from_this());
+			}
+		}
+
 		// Splice task post_exec queue at front of local queue
-		TaskState::rescue_queue(m_post_exec_queue);
+		TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
+		m_sort_queue = false;
 	}

 	string
@@ -365,22 +418,32 @@ namespace crucible {
 	TaskState::run()
 	{
 		unique_lock<mutex> lock(m_mutex);
+		m_idle = false;
 		if (m_run_now) {
 			return;
 		}
 		m_run_now = true;
-		TaskMasterState::push_back(shared_from_this());
+		if (m_is_running) {
+			m_run_again = true;
+		} else {
+			TaskMasterState::push_back(shared_from_this());
+		}
 	}

 	void
 	TaskState::idle()
 	{
 		unique_lock<mutex> lock(m_mutex);
+		m_idle = true;
 		if (m_run_now) {
 			return;
 		}
 		m_run_now = true;
-		TaskMasterState::push_back_idle(shared_from_this());
+		if (m_is_running) {
+			m_run_again = true;
+		} else {
+			TaskMasterState::push_back_idle(shared_from_this());
+		}
 	}

 	TaskMasterState::TaskMasterState(size_t thread_max) :
@@ -740,6 +803,14 @@ namespace crucible {
 		m_task_state->append(that.m_task_state);
 	}

+	void
+	Task::insert(const Task &that) const
+	{
+		THROW_CHECK0(runtime_error, m_task_state);
+		THROW_CHECK0(runtime_error, that);
+		m_task_state->insert(that.m_task_state);
+	}
+
 	Task
 	Task::current_task()
 	{
@@ -854,11 +925,13 @@ namespace crucible {
 		swap(this_consumer, tl_current_consumer);
 		assert(!tl_current_consumer);

-		// Release lock to rescue queue (may attempt to queue a new task at TaskMaster).
-		// rescue_queue normally sends tasks to the local queue of the current TaskConsumer thread,
-		// but we just disconnected ourselves from that.
+		// Release lock to rescue queue (may attempt to queue a
+		// new task at TaskMaster).  rescue_queue normally sends
+		// tasks to the local queue of the current TaskConsumer
+		// thread, but we just disconnected ourselves from that.
+		// No sorting here because this is not a TaskState.
 		lock.unlock();
-		TaskState::rescue_queue(m_local_queue);
+		TaskState::rescue_queue(m_local_queue, false);

 		// Hold lock so we can erase ourselves
 		lock.lock();
@@ -936,21 +1009,6 @@ namespace crucible {
 		m_owner.reset();
 	}

-	void
-	Exclusion::insert_task(const Task &task)
-	{
-		unique_lock<mutex> lock(m_mutex);
-		const auto sp = m_owner.lock();
-		lock.unlock();
-		if (sp) {
-			// If Exclusion is locked then queue task for release;
-			sp->append(task);
-		} else {
-			// otherwise, run the inserted task immediately
-			task.run();
-		}
-	}
-
 	ExclusionLock
 	Exclusion::try_lock(const Task &task)
 	{
@@ -958,7 +1016,7 @@ namespace crucible {
 		const auto sp = m_owner.lock();
 		if (sp) {
 			if (task) {
-				sp->append(task);
+				sp->insert(task);
 			}
 			return ExclusionLock();
 		} else {
--- a/scripts/beesd.in
+++ b/scripts/beesd.in
@@ -1,5 +1,13 @@
 #!/bin/bash

+# if not called from systemd try to replicate mount unsharing on ctrl+c
+# see: https://github.com/Zygo/bees/issues/281
+if [ -z "${SYSTEMD_EXEC_PID}" -a -z "${UNSHARE_DONE}" ]; then
+        UNSHARE_DONE=true
+        export UNSHARE_DONE
+        exec unshare -m --propagation private -- "$0" "$@"
+fi
+
 ## Helpful functions
 INFO(){ echo "INFO:" "$@"; }
 ERRO(){ echo "ERROR:" "$@"; exit 1; }
@@ -108,13 +116,11 @@ mkdir -p "$WORK_DIR" || exit 1
 INFO "MOUNT DIR: $MNT_DIR"
 mkdir -p "$MNT_DIR" || exit 1

-mount --make-private -osubvolid=5 /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
+mount --make-private -osubvolid=5,nodev,noexec /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1

 if [ ! -d "$BEESHOME" ]; then
    INFO "Create subvol $BEESHOME for store bees data"
    btrfs sub cre "$BEESHOME"
-else
-    btrfs sub show "$BEESHOME" &> /dev/null || ERRO "$BEESHOME MUST BE A SUBVOL!"
 fi

 # Check DB size
--- a/scripts/beesd@.service.in
+++ b/scripts/beesd@.service.in
@@ -17,6 +17,7 @@ KillSignal=SIGTERM
 MemoryAccounting=true
 Nice=19
 Restart=on-abnormal
+RuntimeDirectoryMode=0700
 RuntimeDirectory=bees
 StartupCPUWeight=25
 StartupIOWeight=25
--- a/src/bees-context.cc
+++ b/src/bees-context.cc
@@ -230,8 +230,10 @@ BeesContext::dedup(const BeesRangePair &brp_in)
 	BeesAddress first_addr(brp.first.fd(), brp.first.begin());
 	BeesAddress second_addr(brp.second.fd(), brp.second.begin());

-	if (first_addr.get_physical_or_zero() == second_addr.get_physical_or_zero()) {
-		BEESLOGTRACE("equal physical addresses in dedup");
+	const auto first_gpoz = first_addr.get_physical_or_zero();
+	const auto second_gpoz = second_addr.get_physical_or_zero();
+	if (first_gpoz == second_gpoz) {
+		BEESLOGDEBUG("equal physical addresses " << first_addr << " and " << second_addr << " in dedup");
 		BEESCOUNT(bug_dedup_same_physical);
 	}

@@ -259,7 +261,7 @@ BeesContext::dedup(const BeesRangePair &brp_in)
 				BEESCOUNTADD(dedup_bytes, brp.first.size());
 			} else {
 				BEESCOUNT(dedup_miss);
-				BEESLOGWARN("NO Dedup! " << brp);
+				BEESLOGINFO("NO Dedup! " << brp);
 			}

 			lock.reset();
@@ -373,7 +375,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 		Extent::OBSCURED | Extent::PREALLOC
 	)) {
 		BEESCOUNT(scan_interesting);
-		BEESLOGWARN("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
+		BEESLOGINFO("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
 	}

 	if (e.flags() & Extent::HOLE) {
@@ -385,7 +387,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 	if (e.flags() & Extent::PREALLOC) {
 		// Prealloc is all zero and we replace it with a hole.
 		// No special handling is required here.  Nuke it and move on.
-		BEESLOGINFO("prealloc extent " << e);
+		BEESLOGINFO("prealloc extent " << e << " in " << bfr);
 		// Must not extend past EOF
 		auto extent_size = min(e.end(), bfr.file_size()) - e.begin();
 		// Must hold tmpfile until dedupe is done
@@ -534,7 +536,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)

 			// Hash is toxic
 			if (found_addr.is_toxic()) {
-				BEESLOGWARN("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
+				BEESLOGDEBUG("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
 				// Don't push these back in because we'll never delete them.
 				// Extents may become non-toxic so give them a chance to expire.
 				// hash_table->push_front_hash_addr(hash, found_addr);
@@ -556,7 +558,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 			BeesResolver resolved(m_ctx, found_addr);
 			// Toxic extents are really toxic
 			if (resolved.is_toxic()) {
-				BEESLOGWARN("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
+				BEESLOGDEBUG("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
 				BEESCOUNT(scan_toxic_match);
 				// Make sure we never see this hash again.
 				// It has become toxic since it was inserted into the hash table.
@@ -917,7 +919,7 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)

 	// Sanity check
 	if (bfr.begin() >= bfr.file_size()) {
-		BEESLOGWARN("past EOF: " << bfr);
+		BEESLOGDEBUG("past EOF: " << bfr);
 		BEESCOUNT(scanf_eof);
 		return false;
 	}
--- a/src/bees-hash.cc
+++ b/src/bees-hash.cc
@@ -797,7 +797,7 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
 	for (auto fp = madv_flags; fp->value; ++fp) {
 		BEESTOOLONG("madvise(" << fp->name << ")");
 		if (madvise(m_byte_ptr, m_size, fp->value)) {
-			BEESLOGWARN("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
+			BEESLOGNOTICE("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
 		}
 	}

@@ -811,8 +811,19 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
 		prefetch_loop();
        });

-	// Blacklist might fail if the hash table is not stored on a btrfs
+	// Blacklist might fail if the hash table is not stored on a btrfs,
+	// or if it's on a _different_ btrfs
 	catch_all([&]() {
+		// Root is definitely a btrfs
+		BtrfsIoctlFsInfoArgs root_info;
+		root_info.do_ioctl(m_ctx->root_fd());
+		// Hash might not be a btrfs
+		BtrfsIoctlFsInfoArgs hash_info;
+		// If btrfs fs_info ioctl fails, it must be a different fs
+		if (!hash_info.do_ioctl_nothrow(m_fd)) return;
+		// If Hash is a btrfs, Root must be the same one
+		if (root_info.fsid() != hash_info.fsid()) return;
+		// Hash is on the same one, blacklist it
 		m_ctx->blacklist_insert(BeesFileId(m_fd));
 	});
 }
--- a/src/bees-roots.cc
+++ b/src/bees-roots.cc
--- a/src/bees-trace.cc
+++ b/src/bees-trace.cc
@@ -8,38 +8,32 @@ thread_local BeesTracer *BeesTracer::tl_next_tracer = nullptr;
 thread_local bool BeesTracer::tl_first = true;
 thread_local bool BeesTracer::tl_silent = false;

+bool
+exception_check()
+{
 #if __cplusplus >= 201703
-static
-bool
-exception_check()
-{
 	return uncaught_exceptions();
-}
 #else
-static
-bool
-exception_check()
-{
 	return uncaught_exception();
-}
 #endif
+}

 BeesTracer::~BeesTracer()
 {
 	if (!tl_silent && exception_check()) {
 		if (tl_first) {
-			BEESLOGNOTICE("--- BEGIN TRACE --- exception ---");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE --- exception ---");
 			tl_first = false;
 		}
 		try {
 			m_func();
 		} catch (exception &e) {
-			BEESLOGNOTICE("Nested exception: " << e.what());
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception: " << e.what());
 		} catch (...) {
-			BEESLOGNOTICE("Nested exception ...");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception ...");
 		}
 		if (!m_next_tracer) {
-			BEESLOGNOTICE("---  END  TRACE --- exception ---");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: ---  END  TRACE --- exception ---");
 		}
 	}
 	tl_next_tracer = m_next_tracer;
@@ -49,7 +43,7 @@ BeesTracer::~BeesTracer()
 	}
 }

-BeesTracer::BeesTracer(function<void()> f, bool silent) :
+BeesTracer::BeesTracer(const function<void()> &f, bool silent) :
 	m_func(f)
 {
 	m_next_tracer = tl_next_tracer;
@@ -61,12 +55,12 @@ void
 BeesTracer::trace_now()
 {
 	BeesTracer *tp = tl_next_tracer;
-	BEESLOGNOTICE("--- BEGIN TRACE ---");
+	BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE ---");
 	while (tp) {
 		tp->m_func();
 		tp = tp->m_next_tracer;
 	}
-	BEESLOGNOTICE("---  END  TRACE ---");
+	BEESLOG(BEES_TRACE_LEVEL, "TRACE: ---  END  TRACE ---");
 }

 bool
@@ -91,9 +85,9 @@ BeesNote::~BeesNote()
 	tl_next = m_prev;
 	unique_lock<mutex> lock(s_mutex);
 	if (tl_next) {
-		s_status[crucible::gettid()] = tl_next;
+		s_status[gettid()] = tl_next;
 	} else {
-		s_status.erase(crucible::gettid());
+		s_status.erase(gettid());
 	}
 }

@@ -104,7 +98,7 @@ BeesNote::BeesNote(function<void(ostream &os)> f) :
 	m_prev = tl_next;
 	tl_next = this;
 	unique_lock<mutex> lock(s_mutex);
-	s_status[crucible::gettid()] = tl_next;
+	s_status[gettid()] = tl_next;
 }

 void
--- a/src/bees-types.cc
+++ b/src/bees-types.cc
@@ -457,7 +457,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 			}
 		}
 		if (found_toxic) {
-			BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
+			BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
 			BEESCOUNT(pairbackward_toxic_hash);
 			break;
 		}
@@ -558,7 +558,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 			}
 		}
 		if (found_toxic) {
-			BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
+			BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
 			BEESCOUNT(pairforward_toxic_hash);
 			break;
 		}
@@ -572,7 +572,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
 	}

 	if (first.overlaps(second)) {
-		BEESLOGTRACE("after grow, first " << first << "\n\toverlaps " << second);
+		BEESLOGDEBUG("after grow, first " << first << "\n\toverlaps " << second);
 		BEESCOUNT(bug_grow_pair_overlaps);
 	}

@@ -674,7 +674,7 @@ BeesAddress::magic_check(uint64_t flags)
 	static const unsigned recognized_flags = compressed_flags | delalloc_flags | ignore_flags | unusable_flags;

 	if (flags & ~recognized_flags) {
-		BEESLOGTRACE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
+		BEESLOGNOTICE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
 		m_addr = UNUSABLE;
 		// maybe we throw here?
 		BEESCOUNT(addr_unrecognized);
--- a/src/bees.cc
+++ b/src/bees.cc
@@ -4,6 +4,7 @@
 #include "crucible/process.h"
 #include "crucible/string.h"
 #include "crucible/task.h"
+#include "crucible/uname.h"

 #include <cctype>
 #include <cmath>
@@ -11,17 +12,19 @@

 #include <iostream>
 #include <memory>
+#include <regex>
 #include <sstream>

 // PRIx64
 #include <inttypes.h>

-#include <sched.h>
-#include <sys/fanotify.h>
-
 #include <linux/fs.h>
 #include <sys/ioctl.h>

+// statfs
+#include <linux/magic.h>
+#include <sys/statfs.h>
+
 // setrlimit
 #include <sys/time.h>
 #include <sys/resource.h>
@@ -198,7 +201,7 @@ BeesTooLong::check() const
 	if (age() > m_limit) {
 		ostringstream oss;
 		m_func(oss);
-		BEESLOGWARN("PERFORMANCE: " << *this << " sec: " << oss.str());
+		BEESLOGINFO("PERFORMANCE: " << *this << " sec: " << oss.str());
 	}
 }

@@ -246,10 +249,6 @@ bees_readahead_nolock(int const fd, const off_t offset, const size_t size)
 	Timer readahead_timer;
 	BEESNOTE("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
 	BEESTOOLONG("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
-#if 0
-	// In the kernel, readahead() is identical to posix_fadvise(..., POSIX_FADV_DONTNEED)
-	DIE_IF_NON_ZERO(readahead(fd, offset, size));
-#else
 	// Make sure this data is in page cache by brute force
 	// The btrfs kernel code does readahead with lower ioprio
 	// and might discard the readahead request entirely.
@@ -263,13 +262,16 @@ bees_readahead_nolock(int const fd, const off_t offset, const size_t size)
 		// Ignore errors and short reads.  It turns out our size
 		// parameter isn't all that accurate, so we can't use
 		// the pread_or_die template.
-		(void)!pread(fd, dummy, this_read_size, working_offset);
-		BEESCOUNT(readahead_count);
-		BEESCOUNTADD(readahead_bytes, this_read_size);
+		const auto pr_rv = pread(fd, dummy, this_read_size, working_offset);
+		if (pr_rv >= 0) {
+			BEESCOUNT(readahead_count);
+			BEESCOUNTADD(readahead_bytes, pr_rv);
+		} else {
+			BEESCOUNT(readahead_fail);
+		}
 		working_offset += this_read_size;
 		working_size -= this_read_size;
 	}
-#endif
 	BEESCOUNTADD(readahead_ms, readahead_timer.age() * 1000);
 }

@@ -392,6 +394,73 @@ BeesStringFile::read()
 	return read_string(fd, st.st_size);
 }

+static
+void
+bees_fsync(int const fd)
+{
+
+	// Note that when btrfs renames a temporary over an existing file,
+	// it flushes the temporary, so we get the right behavior if we
+	// just do nothing here (except when the file is first created;
+	// however, in that case the result is the same as if the file
+	// did not exist, was empty, or was filled with garbage).
+	//
+	// Kernel versions prior to 5.16 had bugs which would put ghost
+	// dirents in $BEESHOME if there was a crash when we called
+	// fsync() here.
+	//
+	// Some other filesystems will throw our data away if we don't
+	// call fsync, so we do need to call fsync() on those filesystems.
+	//
+	// Newer btrfs kernel versions rely on fsync() to report
+	// unrecoverable write errors.	If we don't check the fsync()
+	// result, we'll lose the data when we rename().  Kernel 6.2 added
+	// a number of new root causes for the class of "unrecoverable
+	// write errors" so we need to check this now.
+
+	BEESNOTE("checking filesystem type for " << name_fd(fd));
+	// LSB deprecated statfs without providing a replacement that
+	// can fill in the f_type field.
+	struct statfs stf = { 0 };
+	DIE_IF_NON_ZERO(fstatfs(fd, &stf));
+	if (static_cast<decltype(BTRFS_SUPER_MAGIC)>(stf.f_type) != BTRFS_SUPER_MAGIC) {
+		BEESLOGONCE("Using fsync on non-btrfs filesystem type " << to_hex(stf.f_type));
+		BEESNOTE("fsync non-btrfs " << name_fd(fd));
+		DIE_IF_NON_ZERO(fsync(fd));
+		return;
+	}
+
+	static bool did_uname = false;
+	static bool do_fsync = false;
+
+	if (!did_uname) {
+		Uname uname;
+		const string version(uname.release);
+		static const regex version_re(R"/(^(\d+)\.(\d+)\.)/", regex::optimize | regex::ECMAScript);
+		smatch m;
+		// Last known bug in the fsync-rename use case was fixed in kernel 5.16
+		static const auto min_major = 5, min_minor = 16;
+		if (regex_search(version, m, version_re)) {
+			const auto major = stoul(m[1]);
+			const auto minor = stoul(m[2]);
+			if (tie(major, minor) > tie(min_major, min_minor)) {
+				BEESLOGONCE("Using fsync on btrfs because kernel version is " << major << "." << minor);
+				do_fsync = true;
+			} else {
+				BEESLOGONCE("Not using fsync on btrfs because kernel version is " << major << "." << minor);
+			}
+		} else {
+			BEESLOGONCE("Not using fsync on btrfs because can't parse kernel version '" << version << "'");
+		}
+		did_uname = true;
+	}
+
+	if (do_fsync) {
+		BEESNOTE("fsync btrfs " << name_fd(fd));
+		DIE_IF_NON_ZERO(fsync(fd));
+	}
+}
+
 void
 BeesStringFile::write(string contents)
 {
@@ -407,19 +476,8 @@ BeesStringFile::write(string contents)
 		Fd ofd = openat_or_die(m_dir_fd, tmpname, FLAGS_CREATE_FILE, S_IRUSR | S_IWUSR);
 		BEESNOTE("writing " << tmpname << " in " << name_fd(m_dir_fd));
 		write_or_die(ofd, contents);
-#if 0
-		// This triggers too many btrfs bugs.  I wish I was kidding.
-		// Forget snapshots, balance, compression, and dedupe:
-		// the system call you have to fear on btrfs is fsync().
-		// Also note that when bees renames a temporary over an
-		// existing file, it flushes the temporary, so we get
-		// the right behavior if we just do nothing here
-		// (except when the file is first created; however,
-		// in that case the result is the same as if the file
-		// did not exist, was empty, or was filled with garbage).
 		BEESNOTE("fsyncing " << tmpname << " in " << name_fd(m_dir_fd));
-		DIE_IF_NON_ZERO(fsync(ofd));
-#endif
+		bees_fsync(ofd);
 	}
 	BEESNOTE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
 	BEESTRACE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
@@ -444,6 +502,23 @@ BeesTempFile::resize(off_t offset)
 	// Count time spent here
 	BEESCOUNTADD(tmp_resize_ms, resize_timer.age() * 1000);

+	// Modify flags - every time
+	// - btrfs will keep trying to set FS_NOCOMP_FL behind us when compression heuristics identify
+	//   the data as compressible, but it fails to compress
+	// - clear FS_NOCOW_FL because we can only dedupe between files with the same FS_NOCOW_FL state,
+	//   and we don't open FS_NOCOW_FL files for dedupe.
+	BEESTRACE("Getting FS_COMPR_FL and FS_NOCOMP_FL on m_fd " << name_fd(m_fd));
+	int flags = ioctl_iflags_get(m_fd);
+	const auto orig_flags = flags;
+
+	flags |= FS_COMPR_FL;
+	flags &= ~(FS_NOCOMP_FL | FS_NOCOW_FL);
+	if (flags != orig_flags) {
+		BEESTRACE("Setting FS_COMPR_FL and clearing FS_NOCOMP_FL | FS_NOCOW_FL on m_fd " << name_fd(m_fd) << " flags " << to_hex(flags));
+		ioctl_iflags_set(m_fd, flags);
+	}
+
+	// That may have queued some delayed ref deletes, so throttle them
 	bees_throttle(resize_timer.age(), "tmpfile_resize");
 }

@@ -485,13 +560,6 @@ BeesTempFile::BeesTempFile(shared_ptr<BeesContext> ctx) :
 	// Add this file to open_root_ino lookup table
 	m_roots->insert_tmpfile(m_fd);

-	// Set compression attribute
-	BEESTRACE("Getting FS_COMPR_FL on m_fd " << name_fd(m_fd));
-	int flags = ioctl_iflags_get(m_fd);
-	flags |= FS_COMPR_FL;
-	BEESTRACE("Setting FS_COMPR_FL on m_fd " << name_fd(m_fd) << " flags " << to_hex(flags));
-	ioctl_iflags_set(m_fd, flags);
-
 	// Count time spent here
 	BEESCOUNTADD(tmp_create_ms, create_timer.age() * 1000);

@@ -683,7 +751,7 @@ bees_main(int argc, char *argv[])
 			BEESLOGDEBUG("exception (ignored): " << s);
 			BEESCOUNT(exception_caught_silent);
 		} else {
-			BEESLOGNOTICE("\n\n*** EXCEPTION ***\n\t" << s << "\n***\n");
+			BEESLOG(BEES_TRACE_LEVEL, "TRACE: EXCEPTION: " << s);
 			BEESCOUNT(exception_caught);
 		}
 	});
@@ -704,9 +772,8 @@ bees_main(int argc, char *argv[])
 	shared_ptr<BeesContext> bc = make_shared<BeesContext>();
 	BEESLOGDEBUG("context constructed");

-	string cwd(readlink_or_die("/proc/self/cwd"));
-
 	// Defaults
+	bool use_relative_paths = false;
 	bool chatter_prefix_timestamp = true;
 	double thread_factor = 0;
 	unsigned thread_count = 0;
@@ -778,7 +845,7 @@ bees_main(int argc, char *argv[])
 				thread_min = stoul(optarg);
 				break;
 			case 'P':
-				crucible::set_relative_path(cwd);
+				use_relative_paths = true;
 				break;
 			case 'T':
 				chatter_prefix_timestamp = false;
@@ -796,7 +863,7 @@ bees_main(int argc, char *argv[])
 				root_scan_mode = static_cast<BeesRoots::ScanMode>(stoul(optarg));
 				break;
 			case 'p':
-				crucible::set_relative_path("");
+				use_relative_paths = false;
 				break;
 			case 't':
 				chatter_prefix_timestamp = true;
@@ -866,18 +933,19 @@ bees_main(int argc, char *argv[])
 	BEESLOGNOTICE("setting root path to '" << root_path << "'");
 	bc->set_root_path(root_path);

+	// Set path prefix
+	if (use_relative_paths) {
+		crucible::set_relative_path(name_fd(bc->root_fd()));
+	}
+
 	// Workaround for btrfs send
 	bc->roots()->set_workaround_btrfs_send(workaround_btrfs_send);

 	// Set root scan mode
 	bc->roots()->set_scan_mode(root_scan_mode);

-	if (root_scan_mode == BeesRoots::SCAN_MODE_EXTENT) {
-		MultiLocker::enable_locking(false);
-	} else {
-		// Workaround for a kernel bug that the subvol-based crawlers keep triggering
-		MultiLocker::enable_locking(true);
-	}
+	// Workaround for the logical-ino-vs-clone kernel bug
+	MultiLocker::enable_locking(true);

 	// Start crawlers
 	bc->start();
--- a/src/bees.h
+++ b/src/bees.h
@@ -122,9 +122,9 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
 // macros ----------------------------------------

 #define BEESLOG(lv,x)   do { if (lv < bees_log_level) { Chatter __chatter(lv, BeesNote::get_name()); __chatter << x; } } while (0)
-#define BEESLOGTRACE(x) do { BEESLOG(LOG_DEBUG, x); BeesTracer::trace_now(); } while (0)

-#define BEESTRACE(x)   BeesTracer  SRSLY_WTF_C(beesTracer_,  __LINE__) ([&]()                 { BEESLOG(LOG_ERR, x << " at " << __FILE__ << ":" << __LINE__);   })
+#define BEES_TRACE_LEVEL LOG_DEBUG
+#define BEESTRACE(x)   BeesTracer  SRSLY_WTF_C(beesTracer_,  __LINE__) ([&]()                 { BEESLOG(BEES_TRACE_LEVEL, "TRACE: " << x << " at " << __FILE__ << ":" << __LINE__);   })
 #define BEESTOOLONG(x) BeesTooLong SRSLY_WTF_C(beesTooLong_, __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
 #define BEESNOTE(x)    BeesNote    SRSLY_WTF_C(beesNote_,    __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })

@@ -134,6 +134,14 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
 #define BEESLOGINFO(x)   BEESLOG(LOG_INFO, x)
 #define BEESLOGDEBUG(x)  BEESLOG(LOG_DEBUG, x)

+#define BEESLOGONCE(__x) do { \
+        static bool already_logged = false; \
+        if (!already_logged) { \
+                already_logged = true; \
+                BEESLOGNOTICE(__x); \
+        } \
+} while (false)
+
 #define BEESCOUNT(stat) do { \
 	BeesStats::s_global.add_count(#stat); \
 } while (0)
@@ -185,7 +193,7 @@ class BeesTracer {
 	thread_local static bool tl_silent;
 	thread_local static bool tl_first;
 public:
-	BeesTracer(function<void()> f, bool silent = false);
+	BeesTracer(const function<void()> &f, bool silent = false);
 	~BeesTracer();
 	static void trace_now();
 	static bool get_silent();
@@ -521,7 +529,7 @@ class BeesCrawl {

 	bool fetch_extents();
 	void fetch_extents_harder();
-	bool restart_crawl();
+	bool restart_crawl_unlocked();
 	BeesFileRange bti_to_bfr(const BtrfsTreeItem &bti) const;

 public:
@@ -535,6 +543,7 @@ public:
 	void deferred(bool def_setting);
 	bool deferred() const;
 	bool finished() const;
+	bool restart_crawl();
 };

 class BeesScanMode;
@@ -543,7 +552,8 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	shared_ptr<BeesContext>			m_ctx;

 	BeesStringFile				m_crawl_state_file;
-	map<uint64_t, shared_ptr<BeesCrawl>>	m_root_crawl_map;
+	using CrawlMap = map<uint64_t, shared_ptr<BeesCrawl>>;
+	CrawlMap				m_root_crawl_map;
 	mutex					m_mutex;
 	uint64_t				m_crawl_dirty = 0;
 	uint64_t				m_crawl_clean = 0;
@@ -562,7 +572,7 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	condition_variable			m_stop_condvar;
 	bool					m_stop_requested = false;

-	void insert_new_crawl();
+	CrawlMap insert_new_crawl();
 	Fd open_root_nocache(uint64_t root);
 	Fd open_root_ino_nocache(uint64_t root, uint64_t ino);
 	uint64_t transid_max_nocache();
@@ -578,13 +588,14 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	void current_state_set(const BeesCrawlState &bcs);
 	bool crawl_batch(shared_ptr<BeesCrawl> crawl);
 	void clear_caches();
-
-friend class BeesScanModeExtent;
 	shared_ptr<BeesCrawl> insert_root(const BeesCrawlState &bcs);
+	bool up_to_date(const BeesCrawlState &bcs);

 friend class BeesCrawl;
 friend class BeesFdCache;
 friend class BeesScanMode;
+friend class BeesScanModeSubvol;
+friend class BeesScanModeExtent;

 public:
 	BeesRoots(shared_ptr<BeesContext> ctx);
@@ -890,5 +901,6 @@ void bees_readahead_pair(int fd, off_t offset, size_t size, int fd2, off_t offse
 void bees_unreadahead(int fd, off_t offset, size_t size);
 void bees_throttle(double time_used, const char *context);
 string format_time(time_t t);
+bool exception_check();

 #endif
--- a/test/seeker.cc
+++ b/test/seeker.cc
@@ -19,7 +19,9 @@ seeker_finder(const vector<uint64_t> &vec, uint64_t lower, uint64_t upper)
 	if (ub != s.end()) ++ub;
 	if (ub != s.end()) ++ub;
 	for (; ub != s.end(); ++ub) {
-		if (*ub > upper) break;
+		if (*ub > upper) {
+			break;
+		}
 	}
 	return set<uint64_t>(lb, ub);
 }
@@ -28,7 +30,7 @@ static bool test_fails = false;

 static
 void
-seeker_test(const vector<uint64_t> &vec, uint64_t const target)
+seeker_test(const vector<uint64_t> &vec, uint64_t const target, bool const always_out = false)
 {
 	cerr << "Find " << target << " in {";
 	for (auto i : vec) {
@@ -36,11 +38,13 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
 	}
 	cerr << " } = ";
 	size_t loops = 0;
+	tl_seeker_debug_str = make_shared<ostringstream>();
+	bool local_test_fails = false;
 	bool excepted = catch_all([&]() {
-		auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
+		const auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
 			++loops;
 			return seeker_finder(vec, lower, upper);
-		});
+		}, uint64_t(32));
 		cerr << found;
 		uint64_t my_found = 0;
 		for (auto i : vec) {
@@ -52,13 +56,15 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
 			cerr << " (correct)";
 		} else {
 			cerr << " (INCORRECT - right answer is " << my_found << ")";
-			test_fails = true;
+			local_test_fails = true;
 		}
 	});
 	cerr << " (" << loops << " loops)" << endl;
-	if (excepted) {
-		test_fails = true;
+	if (excepted || local_test_fails || always_out) {
+		cerr << dynamic_pointer_cast<ostringstream>(tl_seeker_debug_str)->str();
 	}
+	test_fails = test_fails || local_test_fails;
+	tl_seeker_debug_str.reset();
 }

 static
@@ -89,6 +95,39 @@ test_seeker()
 	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max());
 	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max() - 1);
 	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() - 1 }, numeric_limits<uint64_t>::max());
+
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 0);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 1);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 2);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 3);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 4);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 5);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 6);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 7);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 8);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 9);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 1 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 2 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 3 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 4 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 5 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 6 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 7 );
+	seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 8 );
+
+	// Pulled from a bees debug log
+	seeker_test(vector<uint64_t> {
+		6821962845,
+		6821962848,
+		6821963411,
+		6821963422,
+		6821963536,
+		6821963539,
+		6821963835, // <- appeared during the search, causing an exception
+		6821963841,
+		6822575316,
+	}, 6821971036, true);
 }