docs: add vmalloc bug to kernel bugs list

The bug is: v6.3-rc6: f349b15e183d mm: vmalloc: avoid warn_alloc noise caused by fatal signal The fixes are: v6.4: 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails v6.3.10: c189994b5dd3 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails The bug has been backported to LTS, but the fix has not: v6.2.11: 61334bc29781 mm: vmalloc: avoid warn_alloc noise caused by fatal signal v6.1.24: ef6bd8f64ce0 mm: vmalloc: avoid warn_alloc noise caused by fatal signal v5.15.107: a184df0de132 mm: vmalloc: avoid warn_alloc noise caused by fatal signal Signed-off-by: Zygo Blaxell <bees@furryterror.org>
context: log when LOGICAL_INO returns 0 refs
2025-08-02 13:53:28 +02:00 · 2023-07-06 13:50:12 -04:00 · 2023-07-06 12:54:33 -04:00 · 2023-07-06 12:49:36 -04:00 · 2023-07-06 12:49:36 -04:00 · 2023-05-07 21:24:21 -04:00
48 changed files with 2859 additions and 1482 deletions
--- a/Defines.mk
+++ b/Defines.mk
@@ -2,6 +2,7 @@ MAKE += PREFIX=$(PREFIX) LIBEXEC_PREFIX=$(LIBEXEC_PREFIX) ETC_PREFIX=$(ETC_PREFI

 define TEMPLATE_COMPILER =
 sed $< >$@ \
+		-e's#@DESTDIR@#$(DESTDIR)#' \
 		-e's#@PREFIX@#$(PREFIX)#' \
 		-e's#@ETC_PREFIX@#$(ETC_PREFIX)#' \
 		-e's#@LIBEXEC_PREFIX@#$(LIBEXEC_PREFIX)#'
--- a/7
+++ b/7
@@ -49,11 +49,6 @@ scripts/%: scripts/%.in

 scripts: scripts/beesd scripts/beesd@.service

-install_tools: ## Install support tools + libs
-install_tools: src
-	install -Dm755 bin/fiemap $(DESTDIR)$(PREFIX)/bin/fiemap
-	install -Dm755 bin/fiewalk $(DESTDIR)$(PREFIX)/sbin/fiewalk
-
 install_bees: ## Install bees + libs
 install_bees: src $(RUN_INSTALL_TESTS)
 	install -Dm755 bin/bees	$(DESTDIR)$(LIBEXEC_PREFIX)/bees
@@ -67,7 +62,7 @@ ifneq ($(SYSTEMD_SYSTEM_UNIT_DIR),)
 endif

 install: ## Install distribution
-install: install_bees install_scripts $(OPTIONAL_INSTALL_TARGETS)
+install: install_bees install_scripts

 help: ## Show help
 	@fgrep -h "##" $(MAKEFILE_LIST) | fgrep -v fgrep | sed -e 's/\\$$//' | sed -e 's/##/\t/'
--- a/README.md
+++ b/README.md
@@ -17,7 +17,6 @@ Strengths
 * Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
 * Daemon incrementally dedupes new data using btrfs tree search
 * Works with btrfs compression - dedupe any combination of compressed and uncompressed files
- * **NEW** [Works around `btrfs send` problems with dedupe and incremental parent snapshots](docs/options.md)
 * Works around btrfs filesystem structure to free more disk space
 * Persistent hash table for rapid restart after shutdown
 * Whole-filesystem dedupe - including snapshots
@@ -70,6 +69,6 @@ You can also use Github:
 Copyright & License
 -------------------

-Copyright 2015-2022 Zygo Blaxell <bees@furryterror.org>.
+Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.

 GPL (version 3 or later).
--- a/docs/btrfs-kernel.md
+++ b/docs/btrfs-kernel.md
@@ -7,23 +7,24 @@ First, a warning that is not specific to bees:
 severe regression that can lead to fatal metadata corruption.**
 This issue is fixed in kernel 5.4.14 and later.

-**Recommended kernel versions for bees are 4.19, 5.4, 5.10, 5.11, or 5.12,
-with recent LTS and -stable updates.**  The latest released kernel as
-of this writing is 5.18.18.
+**Recommended kernel versions for bees are 4.19, 5.4, 5.10, 5.11, 5.15,
+6.0, or 6.1, with recent LTS and -stable updates.**  The latest released
+kernel as of this writing is 6.4.1.

-4.14, 4.9, and 4.4 LTS kernels with recent updates are OK with
-some issues.  Older kernels will be slower (a little slower or a lot
-slower depending on which issues are triggered).  Not all fixes are
-backported.
+4.14, 4.9, and 4.4 LTS kernels with recent updates are OK with some
+issues.  Older kernels will be slower (a little slower or a lot slower
+depending on which issues are triggered).  Not all fixes are backported.

 Obsolete non-LTS kernels have a variety of unfixed issues and should
 not be used with btrfs.  For details see the table below.

 bees requires btrfs kernel API version 4.2 or higher, and does not work
-on older kernels.
+at all on older kernels.

-bees will detect and use btrfs kernel API up to version 4.15 if present.
-In some future bees release, this API version may become mandatory.
+Some bees features rely on kernel 4.15 to work, and these features will
+not be available on older kernels.  Currently, bees is still usable on
+older kernels with degraded performance or with options disabled, but
+support for older kernels may be removed.



@@ -58,14 +59,17 @@ These bugs are particularly popular among bees users, though not all are specifi
 | - | 5.8 | deadlock in `TREE_SEARCH` ioctl (core component of bees filesystem scanner), followed by regression in deadlock fix | 4.4.237, 4.9.237, 4.14.199, 4.19.146, 5.4.66, 5.8.10 and later | a48b73eca4ce btrfs: fix potential deadlock in the search ioctl, 1c78544eaa46 btrfs: fix wrong address when faulting in pages in the search ioctl
 | 5.7 | 5.10 | kernel crash if balance receives fatal signal e.g. Ctrl-C | 5.4.93, 5.10.11, 5.11 and later | 18d3bff411c8 btrfs: don't get an EINTR during drop_snapshot for reloc
 | 5.10 | 5.10 | 20x write performance regression | 5.10.8, 5.11 and later | e076ab2a2ca7 btrfs: shrink delalloc pages instead of full inodes
-| 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
+| 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.4.125, 5.10.43, 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
 | - | 5.11 | tree mod log issue #5 | 4.4.263, 4.9.263, 4.14.227, 4.19.183, 5.4.108, 5.10.26, 5.11.9, 5.12 and later | dbcc7d57bffc btrfs: fix race when cloning extent buffer during rewind of an old root
 | - | 5.12 | tree mod log issue #6 | 4.14.233, 4.19.191, 5.4.118, 5.10.36, 5.11.20, 5.12.3, 5.13 and later | f9690f426b21 btrfs: fix race when picking most recent mod log operation for an old root
 | 4.15 | 5.16 | spurious warnings from `fs/fs-writeback.c` when `flushoncommit` is enabled | 5.15.27, 5.16.13, 5.17 and later | a0f0cf8341e3 btrfs: get rid of warning on transaction commit when using flushoncommit
 | - | 5.17 | crash during device removal can make filesystem unmountable | 5.15.54, 5.16.20, 5.17.3, 5.18 and later | bbac58698a55 btrfs: remove device item and update super block in the same transaction
 | - | 5.18 | wrong superblock num_devices makes filesystem unmountable | 4.14.283, 4.19.247, 5.4.198, 5.10.121, 5.15.46, 5.17.14, 5.18.3, 5.19 and later | d201238ccd2f btrfs: repair super block num_devices automatically
 | 5.18 | 5.19 | parent transid verify failed during log tree replay after a crash during a rename operation | 5.18.18, 5.19.2, 6.0 and later | 723df2bcc9e1 btrfs: join running log transaction when logging new name
-| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe ioctl | - | workaround: reduce bees thread count to 1 with `-c1`
+| 5.12 | 6.0 | space cache corruption and potential double allocations | 5.15.65, 5.19.6, 6.0 and later | ced8ecf026fd btrfs: fix space cache corruption and potential double allocations
+| 6.3, backported to 5.15.107, 6.1.24, 6.2.11 | 6.3 | vmalloc error, failed to allocate pages | 6.3.10, 6.4 and later.  Bug (f349b15e183d "mm: vmalloc: avoid warn_alloc noise caused by fatal signal" in v6.3-rc6) backported to 6.1.24, 6.2.11, and 5.15.107. | 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
+| 6.2 | 6.3 | `IGNORE_OFFSET` flag ignored in `LOGICAL_INO` ioctl | 6.2.16, 6.3.3, 6.4 and later | 0cad8f14d70c btrfs: fix backref walking not returning all inode refs
+| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe ioctl on the same extent | - | workaround: avoid doing that

 "Last bad kernel" refers to that version's last stable update from
 kernel.org.  Distro kernels may backport additional fixes.  Consult
@@ -80,31 +84,45 @@ through 5.4.13 inclusive.
 A "-" for "first bad kernel" indicates the bug has been present since
 the relevant feature first appeared in btrfs.

-A "-" for "last bad kernel" indicates the bug has not yet been fixed as
-of 5.18.18.
+A "-" for "last bad kernel" indicates the bug has not yet been fixed in
+current kernels (see top of this page for which kernel version that is).

 In cases where issues are fixed by commits spread out over multiple
 kernel versions, "fixed kernel version" refers to the version that
-contains all components of the fix.
+contains the last committed component of the fix.


 Workarounds for known kernel bugs
 ---------------------------------

-* **Hangs with high worker thread counts**:  On kernels newer than
-  5.4, multiple threads running `LOGICAL_INO` and dedupe ioctls
-  at the same time can lead to a kernel hang.  The workaround is
-  to reduce the thread count to 1 with `-c1`.
+* **Hangs with concurrent `LOGICAL_INO` and dedupe**:  on all
+  kernel versions so far, multiple threads running `LOGICAL_INO`
+  and dedupe ioctls at the same time on the same inodes or extents
+  can lead to a kernel hang.  The kernel enters an infinite loop in
+  `add_all_parents`, where `count` is 0, `ref->count` is 1, and
+  `btrfs_next_item` or `btrfs_next_old_item` never find a matching ref).

-* **Tree mod log issues**:  bees will detect that a btrfs balance is
-  running, and pause bees activity until the balance is done.  This avoids
-  running both the `LOGICAL_INO` ioctl and btrfs balance at the same time,
-  which avoids kernel crashes on old kernel versions.
+  bees has two workarounds for this bug: 1. schedule work so that multiple
+  threads do not simultaneously access the same inode or the same extent,
+  and 2. use a brute-force global lock within bees that prevents any
+  thread from running `LOGICAL_INO` while any other thread is running
+  dedupe.

-  The numbers for "tree mod log issue #" in the above table are arbitrary.
-  There are a lot of them, and they all behave fairly similarly.
+  Workaround #1 isn't really a workaround, since we want to do the same
+  thing for unrelated performance reasons.  If multiple threads try to
+  perform dedupe operations on the same extent or inode, btrfs will make
+  all the threads wait for the same locks anyway, so it's better to have
+  bees find some other inode or extent to work on while waiting for btrfs
+  to finish.

-  This workaround is less necessary for kernels 5.4.19 and later.
+  Workaround #2 doesn't seem to be needed after implementing workaround
+  #1, but it's better to be slightly slower than to hang one CPU core
+  and the filesystem until the kernel is rebooted.
+
+  It is still theoretically possible to trigger the kernel bug when
+  running bees at the same time as other dedupers, or other programs
+  that use `LOGICAL_INO` like `btdu`; however, it's extremely difficult
+  to reproduce the bug without closely cooperating threads.

 * **Slow backrefs** (aka toxic extents):  Under certain conditions,
  if the number of references to a single shared extent grows too
@@ -120,8 +138,8 @@ Workarounds for known kernel bugs
  at this time of writing only bees has a workaround for this bug.

  This workaround is less necessary for kernels 5.4.96, 5.7 and later,
-  though it can still take 2 ms of CPU to resolve each extent ref on a
-  fast machine on a large, heavily fragmented file.
+  though the bees workaround can still be triggered on newer kernels
+  by changes in btrfs since kernel version 5.1.

 * **dedupe breaks `btrfs send` in old kernels**.  The bees option
  `--workaround-btrfs-send` prevents any modification of read-only subvols
@@ -137,8 +155,6 @@ Workarounds for known kernel bugs
 Unfixed kernel bugs
 -------------------

-As of 5.18.18:
-
 * **The kernel does not permit `btrfs send` and dedupe to run at the
  same time**.  Recent kernels no longer crash, but now refuse one
  operation with an error if the other operation was already running.
--- a/docs/btrfs-other.md
+++ b/docs/btrfs-other.md
@@ -8,44 +8,35 @@ bees has been tested in combination with the following:
 * HOLE extents and btrfs no-holes feature
 * Other deduplicators, reflink copies (though bees may decide to redo their work)
 * btrfs snapshots and non-snapshot subvols (RW and RO)
-* Concurrent file modification (e.g. PostgreSQL and sqlite databases, build daemons)
-* all btrfs RAID profiles
+* Concurrent file modification (e.g. PostgreSQL and sqlite databases, VMs, build daemons)
+* All btrfs RAID profiles
 * IO errors during dedupe (read errors will throw exceptions, bees will catch them and skip over the affected extent)
-* Filesystems mounted *with* the flushoncommit option ([lots of harmless kernel log warnings on 4.15 and later](btrfs-kernel.md))
-* Filesystems mounted *without* the flushoncommit option
+* Filesystems mounted with or without the `flushoncommit` option
 * 4K filesystem data block size / clone alignment
 * 64-bit and 32-bit LE host CPUs (amd64, x86, arm)
-* Huge files (>1TB--although Btrfs performance on such files isn't great in general)
-* filesystems up to 30T+ bytes, 100M+ files
+* Large files (kernel 5.4 or later strongly recommended)
+* Filesystems up to 90T+ bytes, 1000M+ files
 * btrfs receive
 * btrfs nodatacow/nodatasum inode attribute or mount option (bees skips all nodatasum files)
 * open(O_DIRECT) (seems to work as well--or as poorly--with bees as with any other btrfs feature)
-* lvmcache:  no problems observed in testing with recent kernels or reported by users in the last year.
+* lvm dm-cache, writecache

 Bad Btrfs Feature Interactions
 ------------------------------

 bees has been tested in combination with the following, and various problems are known:

-* bcache:  no data-losing problems observed in testing with recent kernels
-  or reported by users in the last year.  Some issues observed with
-  bcache interacting badly with some SSD models' firmware, but so far
-  this only causes temporary loss of service, not filesystem damage.
-  This behavior does not seem to be specific to bees (ordinary filesystem
-  tests with rsync and snapshots will reproduce it), but it does prevent
-  any significant testing of bees on bcache.
-
-* btrfs send:  there are bugs in `btrfs send` that can be triggered by bees.
-  The [`--workaround-btrfs-send` option](options.md) works around this issue
-  by preventing bees from modifying read-only snapshots.
+* btrfs send:  there are bugs in `btrfs send` that can be triggered by
+  bees on old kernels.  The [`--workaround-btrfs-send` option](options.md)
+  works around this issue by preventing bees from modifying read-only
+  snapshots.

 * btrfs qgroups:  very slow, sometimes hangs...and it's even worse when
  bees is running.

-* btrfs autodefrag mount option:  hangs and high CPU usage problems
-  reported by users.  bees cannot distinguish autodefrag activity from
-  normal filesystem activity and will likely try to undo the autodefrag
-  if duplicate copies of the defragmented data exist.
+* btrfs autodefrag mount option:  bees cannot distinguish autodefrag
+  activity from normal filesystem activity, and may try to undo the
+  autodefrag if duplicate copies of the defragmented data exist.

 Untested Btrfs Feature Interactions
 -----------------------------------
@@ -54,9 +45,10 @@ bees has not been tested with the following, and undesirable interactions may oc

 * Non-4K filesystem data block size (should work if recompiled)
 * Non-equal hash (SUM) and filesystem data block (CLONE) sizes (need to fix that eventually)
-* btrfs seed filesystems (does anyone even use those?)
-* btrfs out-of-tree kernel patches (e.g. in-kernel dedupe or encryption)
+* btrfs seed filesystems (no particular reason it wouldn't work, but no one has reported trying)
+* btrfs out-of-tree kernel patches (e.g. in-kernel dedupe, encryption, extent tree v2)
 * btrfs-convert from ext2/3/4 (never tested, might run out of space or ignore significant portions of the filesystem due to sanity checks)
 * btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
-* flashcache: an out-of-tree cache-HDD-on-SSD block layer helper.
 * Host CPUs with exotic page sizes, alignment requirements, or endianness (ppc, alpha, sparc, strongarm, s390, mips, m68k...)
+* bcache: used to be in the "bad" list, now in the "untested" list because nobody is rigorously testing, and bcache bugs come and go
+* flashcache: an out-of-tree cache-HDD-on-SSD block layer helper
--- a/docs/config.md
+++ b/docs/config.md
@@ -8,9 +8,10 @@ are reasonable in most cases.
 Hash Table Sizing
 -----------------

-Hash table entries are 16 bytes per data block.  The hash table stores
-the most recently read unique hashes.  Once the hash table is full,
-each new entry in the table evicts an old entry.
+Hash table entries are 16 bytes per data block.  The hash table stores the
+most recently read unique hashes.  Once the hash table is full, each new
+entry added to the table evicts an old entry.  This makes the hash table
+a sliding window over the most recently scanned data from the filesystem.

 Here are some numbers to estimate appropriate hash table sizes:

@@ -25,9 +26,11 @@ Here are some numbers to estimate appropriate hash table sizes:
 Notes:

 * If the hash table is too large, no extra dedupe efficiency is
-obtained, and the extra space just wastes RAM.  Extra space can also slow
-bees down by preventing old data from being evicted, so bees wastes time
-looking for matching data that is no longer present on the filesystem.
+obtained, and the extra space wastes RAM.  If the hash table contains
+more block records than there are blocks in the filesystem, the extra
+space can slow bees down.  A table that is too large prevents obsolete
+data from being evicted, so bees wastes time looking for matching data
+that is no longer present on the filesystem.

 * If the hash table is too small, bees extrapolates from matching
 blocks to find matching adjacent blocks in the filesystem that have been
@@ -36,6 +39,10 @@ one block in common between two extents in order to be able to dedupe
 the entire extents.  This provides significantly more dedupe hit rate
 per hash table byte than other dedupe tools.

+ * There is a fairly wide range of usable hash sizes, and performances
+degrades according to a smooth probabilistic curve in both directions.
+Double or half the optimium size usually works just as well.
+
 * When counting unique data in compressed data blocks to estimate
 optimum hash table size, count the *uncompressed* size of the data.

@@ -66,11 +73,11 @@ data on an uncompressed filesystem.  Dedupe efficiency falls dramatically
 with hash tables smaller than 128MB/TB as the average dedupe extent size
 is larger than the largest possible compressed extent size (128KB).

-* **Short writes** also shorten the average extent length and increase
-optimum hash table size.  If a database writes to files randomly using
-4K page writes, all of these extents will be 4K in length, and the hash
-table size must be increased to retain each one (or the user must accept
-a lower dedupe hit rate).
+* **Short writes or fragmentation** also shorten the average extent
+length and increase optimum hash table size.  If a database writes to
+files randomly using 4K page writes, all of these extents will be 4K
+in length, and the hash table size must be increased to retain each one
+(or the user must accept a lower dedupe hit rate).

   Defragmenting files that have had many short writes increases the
 extent length and therefore reduces the optimum hash table size.
@@ -94,38 +101,75 @@ every time a new client machine's data is added to the server.
 Scanning modes for multiple subvols
 -----------------------------------

-The `--scan-mode` option affects how bees divides resources between
-subvolumes.  This is particularly relevant when there are snapshots,
-as there are tradeoffs to be made depending on how snapshots are used
-on the filesystem.
+The `--scan-mode` option affects how bees schedules worker threads
+between subvolumes.  Scan modes are an experimental feature and will
+likely be deprecated in favor of a better solution.

-Note that if a filesystem has only one subvolume (i.e. the root,
-subvol ID 5) then the `--scan-mode` option has no effect, as there is
-only one subvolume to scan.
+Scan mode can be changed at any time by restarting bees with a different
+mode option.  Scan state tracking is the same for all of the currently
+implemented modes.  The difference between the modes is the order in
+which subvols are selected.

-The default mode is mode 0, "lockstep".  In this mode, each inode of each
-subvol is scanned at the same time, before moving to the next inode in
-each subvol.  This maximizes the likelihood that all of the references to
-a snapshot of a file are scanned at the same time, which takes advantage
-of VFS caching in the Linux kernel.  If snapshots are created very often,
-bees will not make very good progress as it constantly restarts the
-filesystem scan from the beginning each time a new snapshot is created.
+If a filesystem has only one subvolume with data in it, then the
+`--scan-mode` option has no effect.  In this case, there is only one
+subvolume to scan, so worker threads will all scan that one.

-Scan mode 1, "independent", simply scans every subvol independently
-in parallel.  Each subvol's scanner shares time equally with all other
-subvol scanners.  Whenever a new subvol appears, a new scanner is
-created and the new subvol scanner doesn't affect the behavior of any
-existing subvol scanner.
+Within a subvol, there is a single optimal scan order:  files are scanned
+in ascending numerical inode order.  Each worker will scan a different
+inode to avoid having the threads contend with each other for locks.
+File data is read sequentially and in order, but old blocks from earlier
+scans are skipped.

-Scan mode 2, "sequential", processes each subvol completely before
-proceeding to the next subvol.  This is a good mode when using bees for
-the first time on a filesystem that already has many existing snapshots
-and a high rate of new snapshot creation.  Short-lived snapshots
-(e.g. those used for `btrfs send`) are effectively ignored, and bees
-directs its efforts toward older subvols that are more likely to be
-origin subvols for snapshots.  By deduping origin subvols first, bees
-ensures that future snapshots will already be deduplicated and do not
-need to be deduplicated again.
+Between subvols, there are several scheduling algorithms with different
+trade-offs:
+
+Scan mode 0, "lockstep", scans the same inode number in each subvol at
+close to the same time.  This is useful if the subvols are snapshots
+with a common ancestor, since the same inode number in each subvol will
+have similar or identical contents.  This maximizes the likelihood
+that all of the references to a snapshot of a file are scanned at
+close to the same time, improving dedupe hit rate and possibly taking
+advantage of VFS caching in the Linux kernel.  If the subvols are
+unrelated (i.e. not snapshots of a single subvol) then this mode does
+not provide significant benefit over random selection.  This mode uses
+smaller amounts of temporary space for shorter periods of time when most
+subvols are snapshots.  When a new snapshot is created, this mode will
+stop scanning other subvols and scan the new snapshot until the same
+inode number is reached in each subvol, which will effectively stop
+dedupe temporarily as this data has already been scanned and deduped
+in the other snapshots.
+
+Scan mode 1, "independent", scans the next inode with new data in each
+subvol.  Each subvol's scanner shares inodes uniformly with all other
+subvol scanners until the subvol has no new inodes left.  This mode makes
+continuous forward progress across the filesystem and provides average
+performance across a variety of workloads, but is slow to respond to new
+data, and may spend a lot of time deduping short-lived subvols that will
+soon be deleted when it is preferable to dedupe long-lived subvols that
+will be the origin of future snapshots.  When a new snapshot is created,
+previous subvol scans continue as before, but the time is now divided
+among one more subvol.
+
+Scan mode 2, "sequential", scans one subvol at a time, in numerical subvol
+ID order, processing each subvol completely before proceeding to the
+next subvol.  This avoids spending time scanning short-lived snapshots
+that will be deleted before they can be fully deduped (e.g. those used
+for `btrfs send`).  Scanning is concentrated on older subvols that are
+more likely to be origin subvols for future snapshots, eliminating the
+need to dedupe future snapshots separately.  This mode uses the largest
+amount of temporary space for the longest time, and typically requires
+a larger hash table to maintain dedupe hit rate.
+
+Scan mode 3, "recent", scans the subvols with the highest `min_transid`
+value first (i.e. the ones that were most recently completely scanned),
+then falls back to "independent" mode to break ties.  This interrupts
+long scans of old subvols to give a rapid dedupe response to new data,
+then returns to the old subvols after the new data is scanned.  It is
+useful for large filesystems with multiple active subvols and rotating
+snapshots, where the first-pass scan can take months, but new duplicate
+data appears every day.
+
+The default scan mode is 1, "independent".

 If you are using bees for the first time on a filesystem with many
 existing snapshots, you should read about [snapshot gotchas](gotchas.md).
--- a/docs/event-counters.md
+++ b/docs/event-counters.md
@@ -118,6 +118,7 @@ crawl

 The `crawl` event group consists of operations related to scanning btrfs trees to find new extent refs to scan for dedupe.

+ * `crawl_again`: An inode crawl was restarted because the extent was already locked by another running crawl.
 * `crawl_blacklisted`: An extent was not scanned because it belongs to a blacklisted file.
 * `crawl_create`: A new subvol crawler was created.
 * `crawl_done`: One pass over all subvols on the filesystem was completed.
@@ -133,7 +134,6 @@ The `crawl` event group consists of operations related to scanning btrfs trees t
 * `crawl_nondata`: An item in the search results is not data.
 * `crawl_prealloc`: An extent item in the search results refers to a `PREALLOC` extent.
 * `crawl_push`: An extent item in the search results is suitable for scanning and deduplication.
- * `crawl_restart`: A subvol crawl was restarted with a new `min_transid..max_transid` range.
 * `crawl_scan`: An extent item in the search results is submitted to `BeesContext::scan_forward` for scanning and deduplication.
 * `crawl_search`: A `TREE_SEARCH_V2` ioctl call was successful.
 * `crawl_unknown`: An extent item in the search results has an unrecognized type.
@@ -296,6 +296,7 @@ resolve

 The `resolve` event group consists of operations related to translating a btrfs virtual block address (i.e. physical block address) to a `(root, inode, offset)` tuple (i.e. locating and opening the file containing a matching block).  `resolve` is the top level, `chase` and `adjust` are the lower two levels.

+ * `resolve_empty`: The `LOGICAL_INO` ioctl returned successfully with an empty reference list (0 items).
 * `resolve_fail`: The `LOGICAL_INO` ioctl returned an error.
 * `resolve_large`: The `LOGICAL_INO` ioctl returned more than 2730 results (the limit of the v1 ioctl).
 * `resolve_ms`: Total time spent in the `LOGICAL_INO` ioctl (i.e. wallclock time, not kernel CPU time).
@@ -363,6 +364,8 @@ scanf

 The `scanf` event group consists of operations related to `BeesContext::scan_forward`.  This is the entry point where `crawl` schedules new data for scanning.

+ * `scanf_deferred_extent`: Two tasks attempted to scan the same extent at the same time, so one was deferred.
+ * `scanf_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
 * `scanf_extent`: A btrfs extent item was scanned.
 * `scanf_extent_ms`: Total thread-seconds spent scanning btrfs extent items.
 * `scanf_total`: A logical byte range of a file was scanned.
--- a/docs/gotchas.md
+++ b/docs/gotchas.md
@@ -51,81 +51,40 @@ loops early.  The exception text in this case is:
 Terminating bees with SIGTERM
 -----------------------------

-bees is designed to survive host crashes, so it is safe to terminate
-bees using SIGKILL; however, when bees next starts up, it will repeat
-some work that was performed between the last bees crawl state save point
-and the SIGKILL (up to 15 minutes).  If bees is stopped and started less
-than once per day, then this is not a problem as the proportional impact
-is quite small; however, users who stop and start bees daily or even
-more often may prefer to have a clean shutdown with SIGTERM so bees can
-restart faster.
+bees is designed to survive host crashes, so it is safe to terminate bees
+using SIGKILL; however, when bees next starts up, it will repeat some
+work that was performed between the last bees crawl state save point
+and the SIGKILL (up to 15 minutes), and a large hash table may not be
+completely written back to disk, so some duplicate matches will be lost.

-bees handling of SIGTERM can take a long time on machines with some or
-all of:
+If bees is stopped and started less than once per week, then this is not
+a problem as the proportional impact is quite small; however, users who
+stop and start bees daily or even more often may prefer to have a clean
+shutdown with SIGTERM so bees can restart faster.

-   * Large RAM and `vm.dirty_ratio`
-   * Large number of active bees worker threads
-   * Large number of bees temporary files (proportional to thread count)
-   * Large hash table size
-   * Large filesystem size
-   * High IO latency, especially "low power" spinning disks
-   * High filesystem activity, especially duplicate data writes
+The shutdown procedure performs these steps:

-Each of these factors individually increases the total time required
-to perform a clean bees shutdown.  When combined, the factors can
-multiply with each other, dramatically increasing the time required to
-flush bees state to disk.
-
-On a large system with many of the above factors present, a "clean"
-bees shutdown can take more than 20 minutes.  Even a small machine
-(16GB RAM, 1GB hash table, 1TB NVME disk) can take several seconds to
-complete a SIGTERM shutdown.
-
-The shutdown procedure performs potentially long-running tasks in
-this order:
-
-   1.  Worker threads finish executing their current Task and exit.
-       Threads executing `LOGICAL_INO` ioctl calls usually finish quickly,
-       but btrfs imposes no limit on the ioctl's running time, so it
-       can take several minutes in rare bad cases.  If there is a btrfs
-       commit already in progress on the filesystem, then most worker
-       threads will be blocked until the btrfs commit is finished.
-
-   2.  Crawl state is saved to `$BEESHOME`.  This normally completes
-       relatively quickly (a few seconds at most).  This is the most
+   1.  Crawl state is saved to `$BEESHOME`.  This is the most
       important bees state to save to disk as it directly impacts
-       restart time, so it is done as early as possible (but no earlier).
+       restart time, so it is done as early as possible

-   3.  Hash table is written to disk.  Normally the hash table is
-       trickled back to disk at a rate of about 2GB per hour;
+   2.  Hash table is written to disk.  Normally the hash table is
+       trickled back to disk at a rate of about 128KiB per second;
       however, SIGTERM causes bees to attempt to flush the whole table
-       immediately.  If bees has recently been idle then the hash table is
-       likely already flushed to disk, so this step will finish quickly;
-       however, if bees has recently been active and the hash table is
-       large relative to RAM size, the blast of rapidly written data
-       can force the Linux VFS to block all writes to the filesystem
-       for sufficient time to complete all pending btrfs metadata
-       writes which accumulated during the btrfs commit before bees
-       received SIGTERM...and _then_ let bees write out the hash table.
-       The time spent here depends on the size of RAM, speed of disks,
-       and aggressiveness of competing filesystem workloads.
+       immediately.  The time spent here depends on the size of RAM, speed
+       of disks, and aggressiveness of competing filesystem workloads.
+       It can trigger `vm.dirty_bytes` limits and block other processes
+       writing to the filesystem for a while.

-   4.  bees temporary files are closed, which implies deletion of their
-       inodes.  These are files which consist entirely of shared extent
-       structures, and btrfs takes an unusually long time to delete such
-       files (up to a few minutes for each on slow spinning disks).
+   3.  The bees process calls `_exit`, which terminates all running
+       worker threads, closes and deletes all temporary files.  This
+       can take a while _after_ the bees process exits, especially on
+       slow spinning disks.

-If bees is terminated with SIGKILL, only step #1 and #4 are performed (the
-kernel performs these automatically if bees exits).  This reduces the
-shutdown time at the cost of increased startup time.

 Balances
 --------

-First, read [`LOGICAL_INO` and btrfs balance WARNING](btrfs-kernel.md).
-bees will suspend operations during a btrfs balance to work around
-kernel bugs.
-
 A btrfs balance relocates data on disk by making a new copy of the
 data, replacing all references to the old data with references to the
 new copy, and deleting the old copy.  To bees, this is the same as any
@@ -175,7 +134,9 @@ the beginning.

 Each time bees dedupes an extent that is referenced by a snapshot,
 the entire metadata page in the snapshot subvol (16KB by default) must
-be CoWed in btrfs.  This can result in a substantial increase in btrfs
+be CoWed in btrfs.  Since all references must be removed at the same
+time, this CoW operation is repeated in every snapshot containing the
+duplicate data.  This can result in a substantial increase in btrfs
 metadata size if there are many snapshots on a filesystem.

 Normally, metadata is small (less than 1% of the filesystem) and dedupe
@@ -252,17 +213,18 @@ Other Gotchas
  filesystem while `LOGICAL_INO` is running.  Generally the CPU spends
  most of the runtime of the `LOGICAL_INO` ioctl running the kernel,
  so on a single-core CPU the entire system can freeze up for a second
-  during operations on toxic extents.
+  during operations on toxic extents.  Note this only occurs on older
+  kernels.  See [the slow backrefs kernel bug section](btrfs-kernel.md).

 * If a process holds a directory FD open, the subvol containing the
  directory cannot be deleted (`btrfs sub del` will start the deletion
  process, but it will not proceed past the first open directory FD).
  `btrfs-cleaner` will simply skip over the directory *and all of its
  children* until the FD is closed.  bees avoids this gotcha by closing
-  all of the FDs in its directory FD cache every 10 btrfs transactions.
+  all of the FDs in its directory FD cache every btrfs transaction.

 * If a file is deleted while bees is caching an open FD to the file,
  bees continues to scan the file.  For very large files (e.g. VM
  images), the deletion of the file can be delayed indefinitely.
  To limit this delay, bees closes all FDs in its file FD cache every
-  10 btrfs transactions.
+  btrfs transaction.
--- a/docs/how-it-works.md
+++ b/docs/how-it-works.md
@@ -8,10 +8,12 @@ bees uses checkpoints for persistence to eliminate the IO overhead of a
 transactional data store.  On restart, bees will dedupe any data that
 was added to the filesystem since the last checkpoint.  Checkpoints
 occur every 15 minutes for scan progress, stored in `beescrawl.dat`.
-The hash table trickle-writes to disk at 4GB/hour to `beeshash.dat`.
-An hourly performance report is written to `beesstats.txt`.  There are
-no special requirements for bees hash table storage--`.beeshome` could
-be stored on a different btrfs filesystem, ext4, or even CIFS.
+The hash table trickle-writes to disk at 128KiB/s to `beeshash.dat`,
+but will flush immediately if bees is terminated by SIGTERM.
+
+There are no special requirements for bees hash table storage--`.beeshome`
+could be stored on a different btrfs filesystem, ext4, or even CIFS (but
+not MS-DOS--beeshome does need filenames longer than 8.3).

 bees uses a persistent dedupe hash table with a fixed size configured
 by the user.  Any size of hash table can be dedicated to dedupe.  If a
@@ -20,7 +22,7 @@ small as 128KB.

 The bees hash table is loaded into RAM at startup and `mlock`ed so it
 will not be swapped out by the kernel (if swap is permitted, performance
-degrades to nearly zero).
+degrades to nearly zero, for both bees and the swap device).

 bees scans the filesystem in a single pass which removes duplicate
 extents immediately after they are detected.  There are no distinct
@@ -83,12 +85,12 @@ of these functions in userspace, at the expense of encountering [some
 kernel bugs in `LOGICAL_INO` performance](btrfs-kernel.md).

 bees uses only the data-safe `FILE_EXTENT_SAME` (aka `FIDEDUPERANGE`)
-kernel operations to manipulate user data, so it can dedupe live data
-(e.g. build servers, sqlite databases, VM disk images).  It does not
-modify file attributes or timestamps.
+kernel ioctl to manipulate user data, so it can dedupe live data
+(e.g. build servers, sqlite databases, VM disk images).  bees does not
+modify file attributes or timestamps in deduplicated files.

-When bees has scanned all of the data, bees will pause until 10
-transactions have been completed in the btrfs filesystem.  bees tracks
+When bees has scanned all of the data, bees will pause until a new
+transaction has completed in the btrfs filesystem.  bees tracks
 the current btrfs transaction ID over time so that it polls less often
 on quiescent filesystems and more often on busy filesystems.

--- a/docs/index.md
+++ b/docs/index.md
@@ -17,7 +17,6 @@ Strengths
 * Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
 * Daemon incrementally dedupes new data using btrfs tree search
 * Works with btrfs compression - dedupe any combination of compressed and uncompressed files
- * **NEW** [Works around `btrfs send` problems with dedupe and incremental parent snapshots](options.md)
 * Works around btrfs filesystem structure to free more disk space
 * Persistent hash table for rapid restart after shutdown
 * Whole-filesystem dedupe - including snapshots
@@ -70,6 +69,6 @@ You can also use Github:
 Copyright & License
 -------------------

-Copyright 2015-2022 Zygo Blaxell <bees@furryterror.org>.
+Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.

 GPL (version 3 or later).
--- a/docs/install.md
+++ b/docs/install.md
@@ -4,7 +4,7 @@ Building bees
 Dependencies
 ------------

-* C++11 compiler (tested with GCC 4.9, 6.3.0, 8.1.0)
+* C++11 compiler (tested with GCC 8.1.0, 12.2.0)

  Sorry.  I really like closures and shared_ptr, so support
  for earlier compiler versions is unlikely.
@@ -19,7 +19,7 @@ Dependencies

 * [Linux kernel version](btrfs-kernel.md) gets its own page.

-* markdown for documentation
+* markdown to build the documentation

 * util-linux version that provides `blkid` command for the helper
  script `scripts/beesd` to work
@@ -80,7 +80,7 @@ within a temporary runtime directory.
 Packaging
 ---------

-See 'Dependencies' below. Package maintainers can pick ideas for building and
+See 'Dependencies' above. Package maintainers can pick ideas for building and
 configuring the source package from the Gentoo ebuild:

 <https://github.com/gentoo/gentoo/tree/master/sys-fs/bees>
--- a/docs/missing.md
+++ b/docs/missing.md
@@ -2,8 +2,8 @@ Features You Might Expect That bees Doesn't Have
 ------------------------------------------------

 * There's no configuration file (patches welcome!).  There are
-some tunables hardcoded in the source that could eventually become
-configuration options.  There's also an incomplete option parser
+some tunables hardcoded in the source (`src/bees.h`) that could eventually
+become configuration options.  There's also an incomplete option parser
 (patches welcome!).

 * The bees process doesn't fork and writes its log to stdout/stderr.
@@ -43,3 +43,6 @@ compression method or not compress the data (patches welcome!).
 * It is theoretically possible to resize the hash table without starting
 over with a new full-filesystem scan; however, this feature has not been
 implemented yet.
+
+* btrfs maintains csums of data blocks which bees could use to improve
+scan speeds, but bees doesn't use them yet.
--- a/docs/options.md
+++ b/docs/options.md
@@ -40,16 +40,16 @@

 * `--scan-mode MODE` or `-m`

- Specify extent scanning algorithm.  Default `MODE` is 0.
+ Specify extent scanning algorithm.
 **EXPERIMENTAL** feature that may go away.

-  * Mode 0: scan extents in ascending order of (inode, subvol, offset).
-  Keeps shared extents between snapshots together.  Reads files sequentially.
-  Minimizes temporary space usage.
-  * Mode 1: scan extents from all subvols in parallel.  Good performance
-  on non-spinning media when subvols are unrelated.
-  * Mode 2: scan all extents from one subvol at a time.  Good sequential
-  read performance for spinning media.  Maximizes temporary space usage.
+  * Mode 0: lockstep
+  * Mode 1: independent
+  * Mode 2: sequential
+  * Mode 3: recent
+
+ For details of the different scanning modes and the default value of
+ this option, see [bees configuration](config.md).

 ## Workarounds

--- a/include/crucible/btrfs-tree.h
+++ b/include/crucible/btrfs-tree.h
@@ -0,0 +1,204 @@
+#ifndef CRUCIBLE_BTRFS_TREE_H
+#define CRUCIBLE_BTRFS_TREE_H
+
+#include "crucible/fd.h"
+#include "crucible/fs.h"
+#include "crucible/bytevector.h"
+
+namespace crucible {
+	using namespace std;
+
+	class BtrfsTreeItem {
+		uint64_t m_objectid = 0;
+		uint64_t m_offset = 0;
+		uint64_t m_transid = 0;
+		ByteVector m_data;
+		uint8_t m_type = 0;
+	public:
+		uint64_t objectid() const { return m_objectid; }
+		uint64_t offset() const { return m_offset; }
+		uint64_t transid() const { return m_transid; }
+		uint8_t type() const { return m_type; }
+		const ByteVector data() const { return m_data; }
+		BtrfsTreeItem() = default;
+		BtrfsTreeItem(const BtrfsIoctlSearchHeader &bish);
+		BtrfsTreeItem& operator=(const BtrfsIoctlSearchHeader &bish);
+		bool operator!() const;
+
+		/// Member access methods.  Invoking a method on the
+		/// wrong type of item will throw an exception.
+
+		/// @{ Block group items
+		uint64_t block_group_flags() const;
+		uint64_t block_group_used() const;
+		/// @}
+
+		/// @{ Chunk items
+		uint64_t chunk_length() const;
+		uint64_t chunk_type() const;
+		/// @}
+
+		/// @{ Dev extent items (physical byte ranges)
+		uint64_t dev_extent_chunk_offset() const;
+		uint64_t dev_extent_length() const;
+		/// @}
+
+		/// @{ Dev items (devices)
+		uint64_t dev_item_total_bytes() const;
+		uint64_t dev_item_bytes_used() const;
+		/// @}
+
+		/// @{ Inode items
+		uint64_t inode_size() const;
+		/// @}
+
+		/// @{ Extent refs (EXTENT_DATA)
+		uint64_t file_extent_logical_bytes() const;
+		uint64_t file_extent_generation() const;
+		uint64_t file_extent_offset() const;
+		uint64_t file_extent_bytenr() const;
+		uint8_t file_extent_type() const;
+		btrfs_compression_type file_extent_compression() const;
+		/// @}
+
+		/// @{ Extent items (EXTENT_ITEM)
+		uint64_t extent_begin() const;
+		uint64_t extent_end() const;
+		uint64_t extent_generation() const;
+		/// @}
+
+		/// @{ Root items
+		uint64_t root_flags() const;
+		/// @}
+
+		/// @{ Root backref items.
+		uint64_t root_ref_dirid() const;
+		string root_ref_name() const;
+		uint64_t root_ref_parent_rootid() const;
+		/// @}
+	};
+
+	ostream &operator<<(ostream &os, const BtrfsTreeItem &bti);
+
+	class BtrfsTreeFetcher {
+	protected:
+		Fd m_fd;
+		BtrfsIoctlSearchKey m_sk;
+		uint64_t m_tree = 0;
+		uint64_t m_min_transid = 0;
+		uint64_t m_max_transid = numeric_limits<uint64_t>::max();
+		uint64_t m_block_size = 0;
+		uint64_t m_lookbehind_size = 0;
+		uint64_t m_scale_size = 0;
+		uint8_t m_type = 0;
+
+		uint64_t scale_logical(uint64_t logical) const;
+		uint64_t unscale_logical(uint64_t logical) const;
+		const static uint64_t s_max_logical = numeric_limits<uint64_t>::max();
+		uint64_t scaled_max_logical() const;
+
+		virtual void fill_sk(BtrfsIoctlSearchKey &key, uint64_t object);
+		virtual void next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr);
+		virtual uint64_t hdr_logical(const BtrfsIoctlSearchHeader &hdr) = 0;
+		virtual bool hdr_match(const BtrfsIoctlSearchHeader &hdr) = 0;
+		virtual bool hdr_stop(const BtrfsIoctlSearchHeader &hdr) = 0;
+		Fd fd() const;
+		void fd(Fd fd);
+	public:
+		virtual ~BtrfsTreeFetcher() = default;
+		BtrfsTreeFetcher(Fd new_fd);
+		void type(uint8_t type);
+		void tree(uint64_t tree);
+		void transid(uint64_t min_transid, uint64_t max_transid = numeric_limits<uint64_t>::max());
+		/// Block size (sectorsize) of filesystem
+		uint64_t block_size() const;
+		/// Fetch last object < logical, null if not found
+		BtrfsTreeItem prev(uint64_t logical);
+		/// Fetch first object > logical, null if not found
+		BtrfsTreeItem next(uint64_t logical);
+		/// Fetch object at exactly logical, null if not found
+		BtrfsTreeItem at(uint64_t);
+		/// Fetch first object >= logical
+		BtrfsTreeItem lower_bound(uint64_t logical);
+		/// Fetch last object <= logical
+		BtrfsTreeItem rlower_bound(uint64_t logical);
+
+		/// Estimated distance between objects
+		virtual uint64_t lookbehind_size() const;
+		virtual void lookbehind_size(uint64_t);
+
+		/// Scale size (normally block size but must be set to 1 for fs trees)
+		uint64_t scale_size() const;
+		void scale_size(uint64_t);
+	};
+
+	class BtrfsTreeObjectFetcher : public BtrfsTreeFetcher {
+	protected:
+		virtual void fill_sk(BtrfsIoctlSearchKey &key, uint64_t logical) override;
+		virtual uint64_t hdr_logical(const BtrfsIoctlSearchHeader &hdr) override;
+		virtual bool hdr_match(const BtrfsIoctlSearchHeader &hdr) override;
+		virtual bool hdr_stop(const BtrfsIoctlSearchHeader &hdr) override;
+	public:
+		using BtrfsTreeFetcher::BtrfsTreeFetcher;
+	};
+
+	class BtrfsTreeOffsetFetcher : public BtrfsTreeFetcher {
+	protected:
+		uint64_t m_objectid = 0;
+		virtual void fill_sk(BtrfsIoctlSearchKey &key, uint64_t offset) override;
+		virtual uint64_t hdr_logical(const BtrfsIoctlSearchHeader &hdr) override;
+		virtual bool hdr_match(const BtrfsIoctlSearchHeader &hdr) override;
+		virtual bool hdr_stop(const BtrfsIoctlSearchHeader &hdr) override;
+	public:
+		using BtrfsTreeFetcher::BtrfsTreeFetcher;
+		void objectid(uint64_t objectid);
+		uint64_t objectid() const;
+	};
+
+	class BtrfsCsumTreeFetcher : public BtrfsTreeOffsetFetcher {
+	public:
+		const uint32_t BTRFS_CSUM_TYPE_UNKNOWN = uint32_t(1) << 16;
+	private:
+		size_t		m_sum_size = 0;
+		uint32_t	m_sum_type = BTRFS_CSUM_TYPE_UNKNOWN;
+	public:
+		BtrfsCsumTreeFetcher(const Fd &fd);
+
+		uint32_t sum_type() const;
+		size_t sum_size() const;
+		void get_sums(uint64_t logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t count)> output);
+	};
+
+	/// Fetch extent items from extent tree
+	class BtrfsExtentItemFetcher : public BtrfsTreeObjectFetcher {
+	public:
+		BtrfsExtentItemFetcher(const Fd &fd);
+	};
+
+	/// Fetch extent refs from an inode
+	class BtrfsExtentDataFetcher : public BtrfsTreeOffsetFetcher {
+	public:
+		BtrfsExtentDataFetcher(const Fd &fd);
+	};
+
+	/// Fetch inodes from a subvol
+	class BtrfsFsTreeFetcher : public BtrfsTreeObjectFetcher {
+	public:
+		BtrfsFsTreeFetcher(const Fd &fd, uint64_t subvol);
+	};
+
+	class BtrfsInodeFetcher : public BtrfsTreeObjectFetcher {
+	public:
+		BtrfsInodeFetcher(const Fd &fd);
+		BtrfsTreeItem stat(uint64_t subvol, uint64_t inode);
+	};
+
+	class BtrfsRootFetcher : public BtrfsTreeObjectFetcher {
+	public:
+		BtrfsRootFetcher(const Fd &fd);
+		BtrfsTreeItem root(uint64_t subvol);
+	};
+
+}
+
+#endif
--- a/include/crucible/bytevector.h
+++ b/include/crucible/bytevector.h
@@ -1,7 +1,11 @@
 #ifndef _CRUCIBLE_BYTEVECTOR_H_
 #define _CRUCIBLE_BYTEVECTOR_H_

+#include <crucible/error.h>
+
 #include <memory>
+#include <mutex>
+#include <ostream>

 #include <cstdint>
 #include <cstdlib>
@@ -20,6 +24,8 @@ namespace crucible {
 		using iterator = value_type*;

 		ByteVector() = default;
+		ByteVector(const ByteVector &that);
+		ByteVector& operator=(const ByteVector &that);
 		ByteVector(size_t size);
 		ByteVector(const ByteVector &that, size_t start, size_t length);
 		ByteVector(iterator begin, iterator end, size_t min_size = 0);
@@ -48,6 +54,8 @@ namespace crucible {
 	private:
 		Pointer m_ptr;
 		size_t m_size = 0;
+		mutable mutex m_mutex;
+	friend ostream & operator<<(ostream &os, const ByteVector &bv);
 	};

 	template <class T>
@@ -63,9 +71,9 @@ namespace crucible {
 	T*
 	ByteVector::get() const
 	{
+		THROW_CHECK2(out_of_range, size(), sizeof(T), size() >= sizeof(T));
 		return reinterpret_cast<T*>(data());
 	}
-
 }

 #endif // _CRUCIBLE_BYTEVECTOR_H_
--- a/include/crucible/cache.h
+++ b/include/crucible/cache.h
@@ -30,7 +30,7 @@ namespace crucible {
 		map<Key, ListIter>	m_map;
 		LockSet<Key>		m_lockset;
 		size_t			m_max_size;
-		mutex			m_mutex;
+		mutable mutex		m_mutex;

 		void check_overflow();
 		void recent_use(ListIter vp);
@@ -48,6 +48,7 @@ namespace crucible {
 		void expire(Arguments... args);
 		void insert(const Return &r, Arguments... args);
 		void clear();
+		size_t size() const;
 	};

 	template <class Return, class... Arguments>
@@ -190,6 +191,14 @@ namespace crucible {
 		lock.unlock();
 	}

+	template <class Return, class... Arguments>
+	size_t
+	LRUCache<Return, Arguments...>::size() const
+	{
+		unique_lock<mutex> lock(m_mutex);
+		return m_map.size();
+	}
+
 	template<class Return, class... Arguments>
 	Return
 	LRUCache<Return, Arguments...>::operator()(Arguments... args)
--- a/include/crucible/fd.h
+++ b/include/crucible/fd.h
@@ -27,9 +27,9 @@
 namespace crucible {
 	using namespace std;

-	// IOHandle is a file descriptor owner object.  It closes them when destroyed.
-	// Most of the functions here don't use it because these functions don't own FDs.
-	// All good names for such objects are taken.
+	/// File descriptor owner object.  It closes them when destroyed.
+	/// Most of the functions here don't use it because these functions don't own FDs.
+	/// All good names for such objects are taken.
 	class IOHandle {
 		IOHandle(const IOHandle &) = delete;
 		IOHandle(IOHandle &&) = delete;
@@ -43,6 +43,7 @@ namespace crucible {
 		int get_fd() const;
 	};

+	/// Copyable file descriptor.
 	class Fd {
 		static NamedPtr<IOHandle, int> s_named_ptr;
 		shared_ptr<IOHandle> m_handle;
@@ -62,24 +63,29 @@ namespace crucible {

 	// Functions named "foo_or_die" throw exceptions on failure.

-	// Attempt to open the file with the given mode
+	/// Attempt to open the file with the given mode, throw exception on failure.
 	int open_or_die(const string &file, int flags = O_RDONLY, mode_t mode = 0777);
+	/// Attempt to open the file with the given mode, throw exception on failure.
 	int openat_or_die(int dir_fd, const string &file, int flags = O_RDONLY, mode_t mode = 0777);

-	// Decode open parameters
+	/// Decode open flags
 	string o_flags_ntoa(int flags);
+	/// Decode open mode
 	string o_mode_ntoa(mode_t mode);

-	// mmap with its one weird error case
+	/// mmap with its one weird error case
 	void *mmap_or_die(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
-	// Decode mmap parameters
+	/// Decode mmap prot
 	string mmap_prot_ntoa(int prot);
+	/// Decode mmap flags
 	string mmap_flags_ntoa(int flags);

-	// Unlink, rename
+	/// Rename, throw exception on failure.
 	void rename_or_die(const string &from, const string &to);
+	/// Rename, throw exception on failure.
 	void renameat_or_die(int fromfd, const string &frompath, int tofd, const string &topath);

+	/// Truncate, throw exception on failure.
 	void ftruncate_or_die(int fd, off_t size);

 	// Read or write structs:
@@ -87,19 +93,25 @@ namespace crucible {
 	// Three-arg version of read_or_die/write_or_die throws an error on incomplete read/writes
 	// Four-arg version returns number of bytes read/written through reference arg

+	/// Attempt read by pointer and length, throw exception on IO error or short read.
 	void read_or_die(int fd, void *buf, size_t size);
+	/// Attempt read of a POD struct, throw exception on IO error or short read.
 	template <class T> void read_or_die(int fd, T& buf)
 	{
 		return read_or_die(fd, static_cast<void *>(&buf), sizeof(buf));
 	}

+	/// Attempt read by pointer and length, throw exception on IO error but not short read.
 	void read_partial_or_die(int fd, void *buf, size_t size_wanted, size_t &size_read);
+	/// Attempt read of a POD struct, throw exception on IO error but not short read.
 	template <class T> void read_partial_or_die(int fd, T& buf, size_t &size_read)
 	{
 		return read_partial_or_die(fd, static_cast<void *>(&buf), sizeof(buf), size_read);
 	}

+	/// Attempt read at position by pointer and length, throw exception on IO error but not short read.
 	void pread_or_die(int fd, void *buf, size_t size, off_t offset);
+	/// Attempt read at position of a POD struct, throw exception on IO error but not short read.
 	template <class T> void pread_or_die(int fd, T& buf, off_t offset)
 	{
 		return pread_or_die(fd, static_cast<void *>(&buf), sizeof(buf), offset);
@@ -135,14 +147,14 @@ namespace crucible {
 	template<> void pread_or_die<vector<char>>(int fd, vector<char>& str, off_t offset) = delete;
 	template<> void pwrite_or_die<vector<char>>(int fd, const vector<char>& str, off_t offset) = delete;

-	// A different approach to reading a simple string
+	/// Read a simple string.
 	string read_string(int fd, size_t size);

-	// A lot of Unix API wants you to initialize a struct and call
-	// one function to fill it, another function to throw it away,
-	// and has some unknown third thing you have to do when there's
-	// an error.  That's also a C++ object with an exception-throwing
-	// constructor.
+	/// A lot of Unix API wants you to initialize a struct and call
+	/// one function to fill it, another function to throw it away,
+	/// and has some unknown third thing you have to do when there's
+	/// an error.  That's also a C++ object with an exception-throwing
+	/// constructor.
 	struct Stat : public stat {
 		Stat();
 		Stat(int f);
@@ -156,17 +168,17 @@ namespace crucible {

 	string st_mode_ntoa(mode_t mode);

-	// Because it's not trivial to do correctly
+	/// Because it's not trivial to do correctly
 	string readlink_or_die(const string &path);

-	// Determine the name of a FD by readlink through /proc/self/fd/
+	/// Determine the name of a FD by readlink through /proc/self/fd/
 	string name_fd(int fd);

-	// Returns Fd objects because it does own them.
+	/// Returns Fd objects because it does own them.
 	pair<Fd, Fd> socketpair_or_die(int domain = AF_UNIX, int type = SOCK_STREAM, int protocol = 0);

-	// like unique_lock but for flock instead of mutexes...and not trying
-	// to hide the many and subtle differences between those two things *at all*.
+	/// like unique_lock but for flock instead of mutexes...and not trying
+	/// to hide the many and subtle differences between those two things *at all*.
 	class Flock {
 		int	m_fd;
 		bool	m_locked;
@@ -187,7 +199,7 @@ namespace crucible {
 		int fd();
 	};

-	// Doesn't use Fd objects because it's usually just used to replace stdin/stdout/stderr.
+	/// Doesn't use Fd objects because it's usually just used to replace stdin/stdout/stderr.
 	void dup2_or_die(int fd_in, int fd_out);

 }
--- a/include/crucible/fs.h
+++ b/include/crucible/fs.h
@@ -27,20 +27,16 @@ namespace crucible {
 	// wrapper around fallocate(...FALLOC_FL_PUNCH_HOLE...)
 	void punch_hole(int fd, off_t offset, off_t len);

-	struct BtrfsExtentInfo : public btrfs_ioctl_same_extent_info {
-		BtrfsExtentInfo(int dst_fd, off_t dst_offset);
-	};
-
 	struct BtrfsExtentSame {
 		virtual ~BtrfsExtentSame();
 		BtrfsExtentSame(int src_fd, off_t src_offset, off_t src_length);
-		void add(int fd, off_t offset);
+		void add(int fd, uint64_t offset);
 		virtual void do_ioctl();

 		uint64_t m_logical_offset = 0;
 		uint64_t m_length = 0;
 		int m_fd;
-		vector<BtrfsExtentInfo> m_info;
+		vector<btrfs_ioctl_same_extent_info> m_info;
 	};

 	ostream & operator<<(ostream &os, const btrfs_ioctl_same_extent_info *info);
@@ -68,16 +64,17 @@ namespace crucible {
 		ByteVector m_data;
 	};

-	struct BtrfsIoctlLogicalInoArgs : public btrfs_ioctl_logical_ino_args {
+	struct BtrfsIoctlLogicalInoArgs {
 		BtrfsIoctlLogicalInoArgs(uint64_t logical, size_t buf_size = 16 * 1024 * 1024);

 		uint64_t get_flags() const;
 		void set_flags(uint64_t new_flags);
+		void set_logical(uint64_t new_logical);
+		void set_size(uint64_t new_size);

-		virtual void do_ioctl(int fd);
-		virtual bool do_ioctl_nothrow(int fd);
+		void do_ioctl(int fd);
+		bool do_ioctl_nothrow(int fd);

-		size_t m_container_size;
 		struct BtrfsInodeOffsetRootSpan {
 			using iterator = BtrfsInodeOffsetRoot*;
 			using const_iterator = const BtrfsInodeOffsetRoot*;
@@ -88,13 +85,17 @@ namespace crucible {
 			const_iterator cend() const;
 			iterator data() const;
 			void clear();
-			operator vector<BtrfsInodeOffsetRoot>() const;
 		private:
 			iterator m_begin = nullptr;
 			iterator m_end = nullptr;
 		friend struct BtrfsIoctlLogicalInoArgs;
 		} m_iors;
+	private:
+		size_t m_container_size;
 		BtrfsDataContainer m_container;
+		uint64_t m_logical;
+		uint64_t m_flags = 0;
+	friend ostream & operator<<(ostream &os, const BtrfsIoctlLogicalInoArgs *p);
 	};

 	ostream & operator<<(ostream &os, const BtrfsIoctlLogicalInoArgs &p);
@@ -126,15 +127,6 @@ namespace crucible {

 	ostream & operator<<(ostream &os, const BtrfsIoctlDefragRangeArgs *p);

-	// in btrfs/ctree.h, but that's a nightmare to #include here
-	typedef enum {
-		BTRFS_COMPRESS_NONE  = 0,
-		BTRFS_COMPRESS_ZLIB  = 1,
-		BTRFS_COMPRESS_LZO   = 2,
-		BTRFS_COMPRESS_ZSTD  = 3,
-		BTRFS_COMPRESS_TYPES = 3
-	} btrfs_compression_type;
-
 	struct FiemapExtent : public fiemap_extent {
 		FiemapExtent();
 		FiemapExtent(const fiemap_extent &that);
@@ -212,6 +204,7 @@ namespace crucible {

 	string btrfs_search_type_ntoa(unsigned type);
 	string btrfs_search_objectid_ntoa(uint64_t objectid);
+	string btrfs_compress_type_ntoa(uint8_t type);

 	uint64_t btrfs_get_root_id(int fd);
 	uint64_t btrfs_get_root_transid(int fd);
--- a/include/crucible/hexdump.h
+++ b/include/crucible/hexdump.h
@@ -0,0 +1,36 @@
+#ifndef CRUCIBLE_HEXDUMP_H
+#define CRUCIBLE_HEXDUMP_H
+
+#include "crucible/string.h"
+
+#include <ostream>
+
+namespace crucible {
+	using namespace std;
+
+	template <class V>
+	ostream &
+	hexdump(ostream &os, const V &v)
+	{
+		os << "V { size = " << v.size() << ", data:\n";
+		for (size_t i = 0; i < v.size(); i += 8) {
+			string hex, ascii;
+			for (size_t j = i; j < i + 8; ++j) {
+				if (j < v.size()) {
+					uint8_t c = v[j];
+					char buf[8];
+					sprintf(buf, "%02x ", c);
+					hex += buf;
+					ascii += (c < 32 || c > 126) ? '.' : c;
+				} else {
+					hex += "   ";
+					ascii += ' ';
+				}
+			}
+			os << astringprintf("\t%08x %s %s\n", i, hex.c_str(), ascii.c_str());
+		}
+		return os << "}";
+	}
+};
+
+#endif // CRUCIBLE_HEXDUMP_H
--- a/include/crucible/multilock.h
+++ b/include/crucible/multilock.h
@@ -0,0 +1,40 @@
+#ifndef CRUCIBLE_MULTILOCK_H
+#define CRUCIBLE_MULTILOCK_H
+
+#include <condition_variable>
+#include <map>
+#include <memory>
+#include <mutex>
+#include <string>
+
+namespace crucible {
+        using namespace std;
+
+	class MultiLocker {
+		mutex m_mutex;
+		condition_variable m_cv;
+		map<string, size_t> m_counters;
+
+		class LockHandle {
+			const string m_type;
+			MultiLocker &m_parent;
+			bool m_locked = false;
+			void set_locked(bool state);
+		public:
+			~LockHandle();
+			LockHandle(const string &type, MultiLocker &parent);
+		friend class MultiLocker;
+		};
+
+		friend class LockHandle;
+
+		bool is_lock_available(const string &type);
+		void put_lock(const string &type);
+		shared_ptr<LockHandle> get_lock_private(const string &type);
+	public:
+		static shared_ptr<LockHandle> get_lock(const string &type);
+	};
+
+}
+
+#endif // CRUCIBLE_MULTILOCK_H
--- a/include/crucible/namedptr.h
+++ b/include/crucible/namedptr.h
@@ -12,13 +12,18 @@
 namespace crucible {
 	using namespace std;

-	/// Storage for objects with unique names
+	/// A thread-safe container for RAII of shared resources with unique names.

 	template <class Return, class... Arguments>
 	class NamedPtr {
 	public:
+		/// The name in "NamedPtr"
 		using Key = tuple<Arguments...>;
+		/// A shared pointer to the named object with ownership
+		/// tracking that erases the object's stored name when
+		/// the last shared pointer is destroyed.
 		using Ptr = shared_ptr<Return>;
+		/// A function that translates a name into a shared pointer to an object.
 		using Func = function<Ptr(Arguments...)>;
 	private:
 		struct Value;
@@ -29,6 +34,7 @@ namespace crucible {
 			mutex		m_mutex;
 		};
 		using MapPtr = shared_ptr<MapRep>;
+		/// Container for Return pointers.  Destructor removes entry from map.
 		struct Value {
 			Ptr	m_ret_ptr;
 			MapPtr	m_map_rep;
@@ -50,15 +56,21 @@ namespace crucible {
 		void func(Func f);

 		Ptr operator()(Arguments... args);
+
 		Ptr insert(const Ptr &r, Arguments... args);
 	};

+	/// Construct NamedPtr map and define a function to turn a name into a pointer.
 	template <class Return, class... Arguments>
 	NamedPtr<Return, Arguments...>::NamedPtr(Func f) :
 		m_fn(f)
 	{
 	}

+	/// Construct a Value wrapper: the value to store, the argument key to store the value under,
+	/// and a pointer to the map.  Everything needed to remove the key from the map when the
+	/// last NamedPtr is deleted.  NamedPtr then releases its own pointer to the value, which
+	/// may or may not trigger deletion there.
 	template <class Return, class... Arguments>
 	NamedPtr<Return, Arguments...>::Value::Value(Ptr&& ret_ptr, const Key &key, const MapPtr &map_rep) :
 		m_ret_ptr(ret_ptr),
@@ -67,6 +79,8 @@ namespace crucible {
 	{
 	}

+	/// Destroy a Value wrapper: remove a dead Key from the map, then let the member destructors
+	/// do the rest.  The Key might be in the map and not dead, so leave it alone in that case.
 	template <class Return, class... Arguments>
 	NamedPtr<Return, Arguments...>::Value::~Value()
 	{
@@ -88,6 +102,8 @@ namespace crucible {
 		}
 	}

+	/// Find a Return by key and fetch a strong Return pointer.
+	/// Ignore Keys that have expired weak pointers.
 	template <class Return, class... Arguments>
 	typename NamedPtr<Return, Arguments...>::Ptr
 	NamedPtr<Return, Arguments...>::lookup_item(const Key &k)
@@ -109,6 +125,11 @@ namespace crucible {
 		return Ptr();
 	}

+	/// Insert the Return value of calling Func(Arguments...).
+	/// If the value already exists in the map, return the existing value.
+	/// If another thread is already running Func(Arguments...) then this thread
+	/// will block until the other thread finishes inserting the Return in the
+	/// map, and both threads will return the same Return value.
 	template <class Return, class... Arguments>
 	typename NamedPtr<Return, Arguments...>::Ptr
 	NamedPtr<Return, Arguments...>::insert_item(Func fn, Arguments... args)
@@ -169,6 +190,7 @@ namespace crucible {
 		// Release map lock, then key lock
 	}

+	/// (Re)define a function to turn a name into a pointer.
 	template <class Return, class... Arguments>
 	void
 	NamedPtr<Return, Arguments...>::func(Func func)
@@ -177,6 +199,7 @@ namespace crucible {
 		m_fn = func;
 	}

+	/// Convert a name into a pointer using the configured function.
 	template<class Return, class... Arguments>
 	typename NamedPtr<Return, Arguments...>::Ptr
 	NamedPtr<Return, Arguments...>::operator()(Arguments... args)
@@ -184,6 +207,11 @@ namespace crucible {
 		return insert_item(m_fn, args...);
 	}

+	/// Insert a pointer that has already been created under the
+	/// given name.  Useful for inserting a pointer to a derived
+	/// class when the name doesn't contain all of the information
+	/// required for the object, or when the Return is already known by
+	/// some cheaper method than calling the function.
 	template<class Return, class... Arguments>
 	typename NamedPtr<Return, Arguments...>::Ptr
 	NamedPtr<Return, Arguments...>::insert(const Ptr &r, Arguments... args)
@@ -194,4 +222,4 @@ namespace crucible {

 }

-#endif // NAMEDPTR_H
+#endif // CRUCIBLE_NAMEDPTR_H
--- a/include/crucible/ntoa.h
+++ b/include/crucible/ntoa.h
@@ -20,7 +20,7 @@ namespace crucible {
 #define NTOA_TABLE_ENTRY_BITS(x) { .n = (x), .mask = (x), .a = (#x) }

 // Enumerations (entire value matches all bits)
-#define NTOA_TABLE_ENTRY_ENUM(x) { .n = (x), .mask = ~0UL,  .a = (#x) }
+#define NTOA_TABLE_ENTRY_ENUM(x) { .n = (x), .mask = ~0ULL,  .a = (#x) }

 // End of table (sorry, C++ didn't get C99's compound literals, so we have to write out all the member names)
 #define NTOA_TABLE_ENTRY_END() { .n = 0, .mask = 0, .a = nullptr }
--- a/include/crucible/progress.h
+++ b/include/crucible/progress.h
@@ -4,13 +4,20 @@
 #include "crucible/error.h"

 #include <functional>
-#include <map>
 #include <memory>
 #include <mutex>
+#include <set>
+
+#include <cassert>

 namespace crucible {
 	using namespace std;

+	/// A class to track progress of multiple workers using only two points:
+	/// the first and last incomplete state.  The first incomplete
+	/// state can be recorded as a checkpoint to resume later on.
+	/// The last completed state is the starting point for workers that
+	/// need something to do.
 	template <class T>
 	class ProgressTracker {
 		struct ProgressTrackerState;
@@ -19,8 +26,16 @@ namespace crucible {
 		using value_type = T;
 		using ProgressHolder = shared_ptr<ProgressHolderState>;

+		/// Create ProgressTracker with initial begin and end state 'v'.
 		ProgressTracker(const value_type &v);
+
+		/// The first incomplete state.  This is not "sticky",
+		/// it will revert to the end state if there are no
+		/// items in progress.
 		value_type begin() const;
+
+		/// The last incomplete state.  This is "sticky",
+		/// it can only increase and never decrease.
 		value_type end() const;

 		ProgressHolder hold(const value_type &v);
@@ -31,7 +46,7 @@ namespace crucible {
 		struct ProgressTrackerState {
 			using key_type = pair<value_type, ProgressHolderState *>;
 			mutex			m_mutex;
-			map<key_type, bool>	m_in_progress;
+			set<key_type>		m_in_progress;
 			value_type		m_begin;
 			value_type		m_end;
 		};
@@ -39,6 +54,7 @@ namespace crucible {
 		class ProgressHolderState {
 			shared_ptr<ProgressTrackerState>	m_state;
 			const value_type			m_value;
+			using key_type = typename ProgressTrackerState::key_type;
 		public:
 			ProgressHolderState(shared_ptr<ProgressTrackerState> state, const value_type &v);
 			~ProgressHolderState();
@@ -86,7 +102,11 @@ namespace crucible {
 		m_value(v)
 	{
 		unique_lock<mutex> lock(m_state->m_mutex);
-		m_state->m_in_progress[make_pair(m_value, this)] = true;
+		const auto rv = m_state->m_in_progress.insert(key_type(m_value, this));
+		THROW_CHECK1(runtime_error, m_value, rv.second);
+		// Set the beginning to the first existing in-progress item
+		m_state->m_begin = m_state->m_in_progress.begin()->first;
+		// If this value is past the end, move the end, but don't go backwards
 		if (m_state->m_end < m_value) {
 			m_state->m_end = m_value;
 		}
@@ -96,17 +116,15 @@ namespace crucible {
 	ProgressTracker<T>::ProgressHolderState::~ProgressHolderState()
 	{
 		unique_lock<mutex> lock(m_state->m_mutex);
-		m_state->m_in_progress[make_pair(m_value, this)] = false;
-		auto p = m_state->m_in_progress.begin();
-		while (p != m_state->m_in_progress.end()) {
-			if (p->second) {
-				break;
-			}
-			if (m_state->m_begin < p->first.first) {
-				m_state->m_begin = p->first.first;
-			}
-			m_state->m_in_progress.erase(p);
-			p = m_state->m_in_progress.begin();
+		const auto rv = m_state->m_in_progress.erase(key_type(m_value, this));
+		// THROW_CHECK2(runtime_error, m_value, rv, rv == 1);
+		assert(rv == 1);
+		if (m_state->m_in_progress.empty()) {
+			// If we made the list empty, then m_begin == m_end
+			m_state->m_begin = m_state->m_end;
+		} else {
+			// If we deleted the first element, then m_begin = current first element
+			m_state->m_begin = m_state->m_in_progress.begin()->first;
 		}
 	}

--- a/include/crucible/seeker.h
+++ b/include/crucible/seeker.h
@@ -0,0 +1,163 @@
+#ifndef _CRUCIBLE_SEEKER_H_
+#define _CRUCIBLE_SEEKER_H_
+
+#include "crucible/error.h"
+
+#include <algorithm>
+#include <limits>
+
+#include <cstdint>
+
+#if 1
+#include <iostream>
+#include <sstream>
+#define DINIT(__x) __x
+#define DLOG(__x) do { logs << __x << std::endl; } while (false)
+#define DOUT(__err) do { __err << logs.str(); } while (false)
+#else
+#define DINIT(__x) do {} while (false)
+#define DLOG(__x) do {} while (false)
+#define DOUT(__x) do {} while (false)
+#endif
+
+namespace crucible {
+	using namespace std;
+
+	// Requirements for Container<Pos> Fetch(Pos lower, Pos upper):
+	// - fetches objects in Pos order, starting from lower (must be >= lower)
+	// - must return upper if present, may or may not return objects after that
+	// - returns a container of Pos objects with begin(), end(), rbegin(), rend()
+	// - container must iterate over objects in Pos order
+	// - uniqueness of Pos objects not required
+	// - should store the underlying data as a side effect
+	//
+	// Requirements for Pos:
+	// - should behave like an unsigned integer type
+	// - must have specializations in numeric_limits<T> for digits, max(), min()
+	// - must support +, -, -=, and related operators
+	// - must support <, <=, ==, and related operators
+	// - must support Pos / 2 (only)
+	//
+	// Requirements for seek_backward:
+	// - calls Fetch to search Pos space near target_pos
+	// - if no key exists with value <= target_pos, returns the minimum Pos value
+	// - returns the highest key value <= target_pos
+	// - returned key value may not be part of most recent Fetch result
+	// - 1 loop iteration when target_pos exists
+
+	template <class Fetch, class Pos = uint64_t>
+	Pos
+	seek_backward(Pos const target_pos, Fetch fetch, Pos min_step = 1, size_t max_loops = numeric_limits<size_t>::max())
+	{
+		DINIT(ostringstream logs);
+		try {
+			static const Pos end_pos = numeric_limits<Pos>::max();
+			// TBH this probably won't work if begin_pos != 0, i.e. any signed type
+			static const Pos begin_pos = numeric_limits<Pos>::min();
+			// Run a binary search looking for the highest key below target_pos.
+			// Initial upper bound of the search is target_pos.
+			// Find initial lower bound by doubling the size of the range until a key below target_pos
+			// is found, or the lower bound reaches the beginning of the search space.
+			// If the lower bound search reaches the beginning of the search space without finding a key,
+			// return the beginning of the search space; otherwise, perform a binary search between
+			// the bounds now established.
+			Pos lower_bound = 0;
+			Pos upper_bound = target_pos;
+			bool found_low = false;
+			Pos probe_pos = target_pos;
+			// We need one loop for each bit of the search space to find the lower bound,
+			// one loop for each bit of the search space to find the upper bound,
+			// and one extra loop to confirm the boundary is correct.
+			for (size_t loop_count = min(numeric_limits<Pos>::digits * size_t(2) + 1, max_loops); loop_count; --loop_count) {
+				DLOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
+				auto result = fetch(probe_pos, target_pos);
+				const Pos low_pos = result.empty() ? end_pos : *result.begin();
+				const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
+				DLOG(" = " << low_pos << ".." << high_pos);
+				// check for correct behavior of the fetch function
+				THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
+				THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
+				THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
+				if (!found_low) {
+					// if target_pos == end_pos then we will find it in every empty result set,
+					// so in that case we force the lower bound to be lower than end_pos
+					if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
+						// found a lower bound, set the low bound there and switch to binary search
+						found_low = true;
+						lower_bound = low_pos;
+						DLOG("found_low = true, lower_bound = " << lower_bound);
+					} else {
+						// still looking for lower bound
+						// if probe_pos was begin_pos then we can stop with no result
+						if (probe_pos == begin_pos) {
+							DLOG("return: probe_pos == begin_pos " << begin_pos);
+							return begin_pos;
+						}
+						// double the range size, or use the distance between objects found so far
+						THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
+						// already checked low_pos <= high_pos above
+						const Pos want_delta = max(upper_bound - probe_pos, min_step);
+						// avoid underflowing the beginning of the search space
+						const Pos have_delta = min(want_delta, probe_pos - begin_pos);
+						THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
+						// move probe and try again
+						probe_pos = probe_pos - have_delta;
+						DLOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
+						continue;
+					}
+				}
+				if (low_pos <= target_pos && target_pos <= high_pos) {
+					// have keys on either side of target_pos in result
+					// search from the high end until we find the highest key below target
+					for (auto i = result.rbegin(); i != result.rend(); ++i) {
+						// more correctness checking for fetch
+						THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
+						if (*i <= target_pos) {
+							DLOG("return: *i " << *i << " <= target_pos " << target_pos);
+							return *i;
+						}
+					}
+					// if the list is empty then low_pos = high_pos = end_pos
+					// if target_pos = end_pos also, then we will execute the loop
+					// above but not find any matching entries.
+					THROW_CHECK0(runtime_error, result.empty());
+				}
+				if (target_pos <= low_pos) {
+					// results are all too high, so probe_pos..low_pos is too high
+					// lower the high bound to the probe pos
+					upper_bound = probe_pos;
+					DLOG("upper_bound = probe_pos " << probe_pos);
+				}
+				if (high_pos < target_pos) {
+					// results are all too low, so probe_pos..high_pos is too low
+					// raise the low bound to the high_pos
+					DLOG("lower_bound = high_pos " << high_pos);
+					lower_bound = high_pos;
+				}
+				// compute a new probe pos at the middle of the range and try again
+				// we can't have a zero-size range here because we would not have set found_low yet
+				THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
+				const Pos delta = (upper_bound - lower_bound) / 2;
+				probe_pos = lower_bound + delta;
+				if (delta < 1) {
+					// nothing can exist in the range (lower_bound, upper_bound)
+					// and an object is known to exist at lower_bound
+					DLOG("return: probe_pos == lower_bound " << lower_bound);
+					return lower_bound;
+				}
+				THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
+				THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
+				DLOG("loop: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
+			}
+			THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
+				"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
+				"found_low " << found_low);
+		} catch (...) {
+			DOUT(cerr);
+			throw;
+		}
+	}
+}
+
+#endif // _CRUCIBLE_SEEKER_H_
+
--- a/include/crucible/task.h
+++ b/include/crucible/task.h
@@ -3,6 +3,7 @@

 #include <functional>
 #include <memory>
+#include <mutex>
 #include <ostream>
 #include <string>

@@ -92,92 +93,92 @@ namespace crucible {
 		/// Gets the current number of active workers
 		static size_t get_thread_count();

+		/// Gets the current load tracking statistics
+		struct LoadStats {
+			/// Current load extracted from last two 5-second load average samples
+			double current_load;
+			/// Target thread count computed from previous thread count and current load
+			double thread_target;
+			/// Load average for last 60 seconds
+			double loadavg;
+		};
+		static LoadStats get_current_load();
+
 		/// Drop the current queue and discard new Tasks without
 		/// running them.  Currently executing tasks are not
 		/// affected (use set_thread_count(0) to wait for those
 		/// to complete).
 		static void cancel();
-	};

-	// Barrier executes waiting Tasks once the last BarrierLock
-	// is released.  Multiple unique Tasks may be scheduled while
-	// BarrierLocks exist and all will be run() at once upon
-	// release.  If no BarrierLocks exist, Tasks are executed
-	// immediately upon insertion.
+		/// Stop running any new Tasks.  All existing
+		/// Consumer threads will exit.  Does not affect queue.
+		/// Does not wait for threads to exit.  Reversible.
+		static void pause(bool paused = true);
+	};

 	class BarrierState;

-	class BarrierLock {
-		shared_ptr<BarrierState> m_barrier_state;
-		BarrierLock(shared_ptr<BarrierState> pbs);
-	friend class Barrier;
-	public:
-		// Release this Lock immediately and permanently
-		void release();
-	};
-
+	/// Barrier delays the execution of one or more Tasks.
+	/// The Tasks are executed when the last shared reference to the
+	/// BarrierState is released.  Copies of Barrier objects refer
+	/// to the same Barrier state.
 	class Barrier {
 		shared_ptr<BarrierState> m_barrier_state;

-		Barrier(shared_ptr<BarrierState> pbs);
 	public:
 		Barrier();

-		// Prevent execution of tasks behind barrier until
-		// BarrierLock destructor or release() method is called.
-		BarrierLock lock();
-
-		// Schedule a task for execution when no Locks exist
+		/// Schedule a task for execution when last Barrier is released.
 		void insert_task(Task t);
+
+		/// Release this reference to the barrier state.
+		/// Last released reference executes the task.
+		/// Barrier can only be released once, after which the
+		/// object can no longer be used.
+		void release();
 	};

-	// Exclusion provides exclusive access to a ExclusionLock.
-	// One Task will be able to obtain the ExclusionLock; other Tasks
-	// may schedule themselves for re-execution after the ExclusionLock
-	// is released.
-
-	class ExclusionState;
-	class Exclusion;
-
 	class ExclusionLock {
-		shared_ptr<ExclusionState> m_exclusion_state;
-		ExclusionLock(shared_ptr<ExclusionState> pes);
-		ExclusionLock() = default;
+		shared_ptr<Task> m_owner;
+		ExclusionLock(shared_ptr<Task> owner);
 	friend class Exclusion;
 	public:
-		// Calls release()
-		~ExclusionLock();
+		/// Explicit default constructor because we have other kinds
+		ExclusionLock() = default;

-		// Release this Lock immediately and permanently
+		/// Release this Lock immediately and permanently
 		void release();

-		// Test for locked state
+		/// Test for locked state
 		operator bool() const;
 	};

 	class Exclusion {
-		shared_ptr<ExclusionState> m_exclusion_state;
+		mutex m_mutex;
+		weak_ptr<Task> m_owner;

-		Exclusion(shared_ptr<ExclusionState> pes);
 	public:
-		Exclusion(const string &title);
+		/// Attempt to obtain a Lock.  If successful, current Task
+		/// owns the Lock until the ExclusionLock is released
+		/// (it is the ExclusionLock that owns the lock, so it can
+		/// be passed to other Tasks or threads, but this is not
+		/// recommended practice).
+		/// If not successful, current Task is appended to the
+		/// task that currently holds the lock.  Current task is
+		/// expected to release any other ExclusionLock
+		/// objects it holds, and exit its Task function.
+		ExclusionLock try_lock(const Task &task);

-		// Attempt to obtain a Lock.  If successful, current Task
-		// owns the Lock until the ExclusionLock is released
-		// (it is the ExclusionLock that owns the lock, so it can
-		// be passed to other Tasks or threads, but this is not
-		// recommended practice).
-		// If not successful, current Task is expected to call
-		// insert_task(current_task()), release any ExclusionLock
-		// objects it holds, and exit its Task function.
-		ExclusionLock try_lock();
-
-		// Execute Task when Exclusion is unlocked (possibly
-		// immediately).
-		void insert_task(Task t = Task::current_task());
+		/// Execute Task when Exclusion is unlocked (possibly
+		/// immediately).
+		void insert_task(const Task &t);
 	};

+	/// Wrapper around pthread_setname_np which handles length limits
+	void pthread_setname(const string &name);

+	/// Wrapper around pthread_getname_np for symmetry
+	string pthread_getname();
 }

 #endif // CRUCIBLE_TASK_H
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -1,10 +1,9 @@
-TAG ?= $(shell git describe --always --dirty || echo UNKNOWN)
-
 default: libcrucible.a
 %.a: Makefile

 CRUCIBLE_OBJS = \
 	bytevector.o \
+	btrfs-tree.o \
 	chatter.o \
 	city.o \
 	cleanup.o \
@@ -13,6 +12,7 @@ CRUCIBLE_OBJS = \
 	extentwalker.o \
 	fd.o \
 	fs.o \
+	multilock.o \
 	ntoa.o \
 	path.o \
 	process.o \
@@ -30,24 +30,13 @@ BEES_LDFLAGS = $(LDFLAGS)
 configure.h: configure.h.in
 	$(TEMPLATE_COMPILER)

-.depends:
-	mkdir -p $@
-
-.depends/%.dep: %.cc configure.h Makefile | .depends
+%.dep: %.cc configure.h Makefile
 	$(CXX) $(BEES_CXXFLAGS) -M -MF $@ -MT $(<:.cc=.o) $<

-depends.mk: $(CRUCIBLE_OBJS:%.o=.depends/%.dep)
-	cat $^ > $@.new
-	mv -f $@.new $@
-
-.version.cc: configure.h Makefile ../makeflags $(CRUCIBLE_OBJS:.o=.cc) ../include/crucible/*.h
-	echo "namespace crucible { const char *VERSION = \"$(TAG)\"; }" > $@.new
-	if ! cmp "$@.new" "$@"; then mv -fv $@.new $@; fi
-
-include depends.mk
+include $(CRUCIBLE_OBJS:%.o=%.dep)

 %.o: %.cc ../makeflags
 	$(CXX) $(BEES_CXXFLAGS) -o $@ -c $<

-libcrucible.a: $(CRUCIBLE_OBJS) .version.o
+libcrucible.a: $(CRUCIBLE_OBJS)
 	$(AR) rcs $@ $^
--- a/lib/btrfs-tree.cc
+++ b/lib/btrfs-tree.cc
@@ -0,0 +1,684 @@
+#include "crucible/btrfs-tree.h"
+#include "crucible/btrfs.h"
+#include "crucible/error.h"
+#include "crucible/fs.h"
+#include "crucible/hexdump.h"
+#include "crucible/seeker.h"
+
+namespace crucible {
+	using namespace std;
+
+	uint64_t
+	BtrfsTreeItem::extent_begin() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_ITEM_KEY);
+		return m_objectid;
+	}
+
+	uint64_t
+	BtrfsTreeItem::extent_end() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_ITEM_KEY);
+		return m_objectid + m_offset;
+	}
+
+	uint64_t
+	BtrfsTreeItem::extent_generation() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_ITEM_KEY);
+		return btrfs_get_member(&btrfs_extent_item::generation, m_data);
+	}
+
+	uint64_t
+	BtrfsTreeItem::root_ref_dirid() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_BACKREF_KEY);
+		return btrfs_get_member(&btrfs_root_ref::dirid, m_data);
+	}
+
+	string
+	BtrfsTreeItem::root_ref_name() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_BACKREF_KEY);
+		const auto name_len = btrfs_get_member(&btrfs_root_ref::name_len, m_data);
+		const auto name_start = sizeof(struct btrfs_root_ref);
+		const auto name_end = name_len + name_start;
+		THROW_CHECK2(runtime_error, m_data.size(), name_end, m_data.size() >= name_end);
+		return string(m_data.data() + name_start, m_data.data() + name_end);
+	}
+
+	uint64_t
+	BtrfsTreeItem::root_ref_parent_rootid() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_BACKREF_KEY);
+		return offset();
+	}
+
+	uint64_t
+	BtrfsTreeItem::root_flags() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_ITEM_KEY);
+		return btrfs_get_member(&btrfs_root_item::flags, m_data);
+	}
+
+	ostream &
+	operator<<(ostream &os, const BtrfsTreeItem &bti)
+	{
+		os << "BtrfsTreeItem {"
+			<< " objectid = " << to_hex(bti.objectid())
+			<< ", type = " << btrfs_search_type_ntoa(bti.type())
+			<< ", offset = " << to_hex(bti.offset())
+			<< ", transid = " << bti.transid()
+			<< ", data = ";
+		hexdump(os, bti.data());
+		return os;
+	}
+
+	uint64_t
+	BtrfsTreeItem::block_group_flags() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_BLOCK_GROUP_ITEM_KEY);
+		return btrfs_get_member(&btrfs_block_group_item::flags, m_data);
+	}
+
+	uint64_t
+	BtrfsTreeItem::block_group_used() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_BLOCK_GROUP_ITEM_KEY);
+		return btrfs_get_member(&btrfs_block_group_item::used, m_data);
+	}
+
+	uint64_t
+	BtrfsTreeItem::chunk_length() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_CHUNK_ITEM_KEY);
+		return btrfs_get_member(&btrfs_chunk::length, m_data);
+	}
+
+	uint64_t
+	BtrfsTreeItem::chunk_type() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_CHUNK_ITEM_KEY);
+		return btrfs_get_member(&btrfs_chunk::type, m_data);
+	}
+
+	uint64_t
+	BtrfsTreeItem::dev_extent_chunk_offset() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_DEV_EXTENT_KEY);
+		return btrfs_get_member(&btrfs_dev_extent::chunk_offset, m_data);
+	}
+
+	uint64_t
+	BtrfsTreeItem::dev_extent_length() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_DEV_EXTENT_KEY);
+		return btrfs_get_member(&btrfs_dev_extent::length, m_data);
+	}
+
+	uint64_t
+	BtrfsTreeItem::dev_item_total_bytes() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_DEV_ITEM_KEY);
+		return btrfs_get_member(&btrfs_dev_item::total_bytes, m_data);
+	}
+
+	uint64_t
+	BtrfsTreeItem::dev_item_bytes_used() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_DEV_ITEM_KEY);
+		return btrfs_get_member(&btrfs_dev_item::bytes_used, m_data);
+	}
+
+	uint64_t
+	BtrfsTreeItem::inode_size() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_INODE_ITEM_KEY);
+		return btrfs_get_member(&btrfs_inode_item::size, m_data);
+	}
+
+	uint64_t
+	BtrfsTreeItem::file_extent_logical_bytes() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
+		const auto file_extent_item_type = btrfs_get_member(&btrfs_file_extent_item::type, m_data);
+		switch (file_extent_item_type) {
+			case BTRFS_FILE_EXTENT_INLINE:
+				return btrfs_get_member(&btrfs_file_extent_item::ram_bytes, m_data);
+			case BTRFS_FILE_EXTENT_PREALLOC:
+			case BTRFS_FILE_EXTENT_REG:
+				return btrfs_get_member(&btrfs_file_extent_item::num_bytes, m_data);
+			default:
+				THROW_ERROR(runtime_error, "unknown btrfs_file_extent_item type " << file_extent_item_type);
+		}
+	}
+
+	uint64_t
+	BtrfsTreeItem::file_extent_offset() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
+		const auto file_extent_item_type = btrfs_get_member(&btrfs_file_extent_item::type, m_data);
+		switch (file_extent_item_type) {
+			case BTRFS_FILE_EXTENT_INLINE:
+				THROW_ERROR(invalid_argument, "extent is inline " << *this);
+			case BTRFS_FILE_EXTENT_PREALLOC:
+			case BTRFS_FILE_EXTENT_REG:
+				return btrfs_get_member(&btrfs_file_extent_item::offset, m_data);
+			default:
+				THROW_ERROR(runtime_error, "unknown btrfs_file_extent_item type " << file_extent_item_type << " in " << *this);
+		}
+	}
+
+	uint64_t
+	BtrfsTreeItem::file_extent_generation() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
+		return btrfs_get_member(&btrfs_file_extent_item::generation, m_data);
+	}
+
+	uint64_t
+	BtrfsTreeItem::file_extent_bytenr() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
+		auto file_extent_item_type = btrfs_get_member(&btrfs_file_extent_item::type, m_data);
+		switch (file_extent_item_type) {
+			case BTRFS_FILE_EXTENT_INLINE:
+				THROW_ERROR(invalid_argument, "extent is inline " << *this);
+			case BTRFS_FILE_EXTENT_PREALLOC:
+			case BTRFS_FILE_EXTENT_REG:
+				return btrfs_get_member(&btrfs_file_extent_item::disk_bytenr, m_data);
+			default:
+				THROW_ERROR(runtime_error, "unknown btrfs_file_extent_item type " << file_extent_item_type << " in " << *this);
+		}
+	}
+
+	uint8_t
+	BtrfsTreeItem::file_extent_type() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
+		return btrfs_get_member(&btrfs_file_extent_item::type, m_data);
+	}
+
+	btrfs_compression_type
+	BtrfsTreeItem::file_extent_compression() const
+	{
+		THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
+		return static_cast<btrfs_compression_type>(btrfs_get_member(&btrfs_file_extent_item::compression, m_data));
+	}
+
+	BtrfsTreeItem::BtrfsTreeItem(const BtrfsIoctlSearchHeader &bish) :
+		m_objectid(bish.objectid),
+		m_offset(bish.offset),
+		m_transid(bish.transid),
+		m_data(bish.m_data),
+		m_type(bish.type)
+	{
+	}
+
+	BtrfsTreeItem &
+	BtrfsTreeItem::operator=(const BtrfsIoctlSearchHeader &bish)
+	{
+		m_objectid = bish.objectid;
+		m_offset = bish.offset;
+		m_transid = bish.transid;
+		m_data = bish.m_data;
+		m_type = bish.type;
+		return *this;
+	}
+
+	bool
+	BtrfsTreeItem::operator!() const
+	{
+		return m_transid == 0 && m_objectid == 0 && m_offset == 0 && m_type == 0;
+	}
+
+	uint64_t
+	BtrfsTreeFetcher::block_size() const
+	{
+		return m_block_size;
+	}
+
+	BtrfsTreeFetcher::BtrfsTreeFetcher(Fd new_fd) :
+		m_fd(new_fd)
+	{
+		BtrfsIoctlFsInfoArgs bifia;
+		bifia.do_ioctl(fd());
+		m_block_size = bifia.sectorsize;
+		THROW_CHECK1(runtime_error, m_block_size, m_block_size > 0);
+		// We don't believe sector sizes that aren't multiples of 4K
+		THROW_CHECK1(runtime_error, m_block_size, (m_block_size % 4096) == 0);
+		m_lookbehind_size = 128 * 1024;
+		m_scale_size = m_block_size;
+	}
+
+	Fd
+	BtrfsTreeFetcher::fd() const
+	{
+		return m_fd;
+	}
+
+	void
+	BtrfsTreeFetcher::fd(Fd fd)
+	{
+		m_fd = fd;
+	}
+
+	void
+	BtrfsTreeFetcher::type(uint8_t type)
+	{
+		m_type = type;
+	}
+
+	void
+	BtrfsTreeFetcher::tree(uint64_t tree)
+	{
+		m_tree = tree;
+	}
+
+	void
+	BtrfsTreeFetcher::transid(uint64_t min_transid, uint64_t max_transid)
+	{
+		m_min_transid = min_transid;
+		m_max_transid = max_transid;
+	}
+
+	uint64_t
+	BtrfsTreeFetcher::lookbehind_size() const
+	{
+		return m_lookbehind_size;
+	}
+
+	void
+	BtrfsTreeFetcher::lookbehind_size(uint64_t lookbehind_size)
+	{
+		m_lookbehind_size = lookbehind_size;
+	}
+
+	uint64_t
+	BtrfsTreeFetcher::scale_size() const
+	{
+		return m_scale_size;
+	}
+
+	void
+	BtrfsTreeFetcher::scale_size(uint64_t scale_size)
+	{
+		m_scale_size = scale_size;
+	}
+
+	void
+	BtrfsTreeFetcher::fill_sk(BtrfsIoctlSearchKey &sk, uint64_t object)
+	{
+		(void)object;
+		// btrfs allows tree ID 0 meaning the current tree, but we do not.
+		THROW_CHECK0(invalid_argument, m_tree != 0);
+		sk.tree_id = m_tree;
+		sk.min_type = m_type;
+		sk.max_type = m_type;
+		sk.min_transid = m_min_transid;
+		sk.max_transid = m_max_transid;
+		sk.nr_items = 1;
+	}
+
+	void
+	BtrfsTreeFetcher::next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr)
+	{
+		key.next_min(hdr, m_type);
+	}
+
+	BtrfsTreeItem
+	BtrfsTreeFetcher::at(uint64_t logical)
+	{
+		BtrfsIoctlSearchKey &sk = m_sk;
+		fill_sk(sk, logical);
+		// Exact match, should return 0 or 1 items
+		sk.max_type = sk.min_type;
+		sk.nr_items = 1;
+		sk.do_ioctl(fd());
+		THROW_CHECK1(runtime_error, sk.m_result.size(), sk.m_result.size() < 2);
+		for (const auto &i : sk.m_result) {
+			if (hdr_logical(i) == logical && hdr_match(i)) {
+				return i;
+			}
+		}
+		return BtrfsTreeItem();
+	}
+
+	uint64_t
+	BtrfsTreeFetcher::scale_logical(const uint64_t logical) const
+	{
+		THROW_CHECK1(invalid_argument, logical, (logical % m_scale_size) == 0 || logical == s_max_logical);
+		return logical / m_scale_size;
+	}
+
+	uint64_t
+	BtrfsTreeFetcher::scaled_max_logical() const
+	{
+		return scale_logical(s_max_logical);
+	}
+
+	uint64_t
+	BtrfsTreeFetcher::unscale_logical(const uint64_t logical) const
+	{
+		THROW_CHECK1(invalid_argument, logical, logical <= scaled_max_logical());
+		if (logical == scaled_max_logical()) {
+			return s_max_logical;
+		}
+		return logical * scale_size();
+	}
+
+	BtrfsTreeItem
+	BtrfsTreeFetcher::rlower_bound(uint64_t logical)
+	{
+	#if 0
+	#define BTFRLB_DEBUG(x) do { cerr << x; } while (false)
+	#else
+	#define BTFRLB_DEBUG(x) do { } while (false)
+	#endif
+		BtrfsTreeItem closest_item;
+		uint64_t closest_logical = 0;
+		BtrfsIoctlSearchKey &sk = m_sk;
+		size_t loops = 0;
+		BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << endl);
+		seek_backward(scale_logical(logical), [&](uint64_t lower_bound, uint64_t upper_bound) {
+			++loops;
+			fill_sk(sk, unscale_logical(min(scaled_max_logical(), lower_bound)));
+			set<uint64_t> rv;
+			do {
+				sk.nr_items = 4;
+				sk.do_ioctl(fd());
+				BTFRLB_DEBUG("fetch: loop " << loops << " lower_bound..upper_bound " << to_hex(lower_bound) << ".." << to_hex(upper_bound));
+				for (auto &i : sk.m_result) {
+					next_sk(sk, i);
+					const auto this_logical = hdr_logical(i);
+					const auto scaled_hdr_logical = scale_logical(this_logical);
+					BTFRLB_DEBUG(" " << to_hex(scaled_hdr_logical));
+					if (hdr_match(i)) {
+						if (this_logical <= logical && this_logical > closest_logical) {
+							closest_logical = this_logical;
+							closest_item = i;
+						}
+						BTFRLB_DEBUG("(match)");
+						rv.insert(scaled_hdr_logical);
+					}
+					if (scaled_hdr_logical > upper_bound || hdr_stop(i)) {
+						if (scaled_hdr_logical >= upper_bound) {
+							BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
+						}
+						if (hdr_stop(i)) {
+							rv.insert(numeric_limits<uint64_t>::max());
+							BTFRLB_DEBUG("(stop)");
+						}
+						break;
+					} else {
+						BTFRLB_DEBUG("(cont'd)");
+					}
+				}
+				BTFRLB_DEBUG(endl);
+				// We might get a search result that contains only non-matching items.
+				// Keep looping until we find any matching item or we run out of tree.
+			} while (rv.empty() && !sk.m_result.empty());
+			return rv;
+		}, scale_logical(lookbehind_size()));
+		return closest_item;
+	#undef BTFRLB_DEBUG
+	}
+
+	BtrfsTreeItem
+	BtrfsTreeFetcher::lower_bound(uint64_t logical)
+	{
+		BtrfsIoctlSearchKey &sk = m_sk;
+		fill_sk(sk, logical);
+		do {
+			assert(sk.max_offset == s_max_logical);
+			sk.do_ioctl(fd());
+			for (const auto &i : sk.m_result) {
+				if (hdr_match(i)) {
+					return i;
+				}
+				if (hdr_stop(i)) {
+					return BtrfsTreeItem();
+				}
+				next_sk(sk, i);
+			}
+		} while (!sk.m_result.empty());
+		return BtrfsTreeItem();
+	}
+
+	BtrfsTreeItem
+	BtrfsTreeFetcher::next(uint64_t logical)
+	{
+		const auto scaled_logical = scale_logical(logical);
+		if (scaled_logical + 1 > scaled_max_logical()) {
+			return BtrfsTreeItem();
+		}
+		return lower_bound(unscale_logical(scaled_logical + 1));
+	}
+
+	BtrfsTreeItem
+	BtrfsTreeFetcher::prev(uint64_t logical)
+	{
+		const auto scaled_logical = scale_logical(logical);
+		if (scaled_logical < 1) {
+			return BtrfsTreeItem();
+		}
+		return rlower_bound(unscale_logical(scaled_logical - 1));
+	}
+
+	void
+	BtrfsTreeObjectFetcher::fill_sk(BtrfsIoctlSearchKey &sk, uint64_t object)
+	{
+		BtrfsTreeFetcher::fill_sk(sk, object);
+		sk.min_offset = 0;
+		sk.max_offset = numeric_limits<decltype(sk.max_offset)>::max();
+		sk.min_objectid = object;
+		sk.max_objectid = numeric_limits<decltype(sk.max_objectid)>::max();
+	}
+
+	uint64_t
+	BtrfsTreeObjectFetcher::hdr_logical(const BtrfsIoctlSearchHeader &hdr)
+	{
+		return hdr.objectid;
+	}
+
+	bool
+	BtrfsTreeObjectFetcher::hdr_match(const BtrfsIoctlSearchHeader &hdr)
+	{
+		// If you're calling this method without overriding it, you should have set type first
+		assert(m_type);
+		return hdr.type == m_type;
+	}
+
+	bool
+	BtrfsTreeObjectFetcher::hdr_stop(const BtrfsIoctlSearchHeader &hdr)
+	{
+		return false;
+		(void)hdr;
+	}
+
+	uint64_t
+	BtrfsTreeOffsetFetcher::hdr_logical(const BtrfsIoctlSearchHeader &hdr)
+	{
+		return hdr.offset;
+	}
+
+	bool
+	BtrfsTreeOffsetFetcher::hdr_match(const BtrfsIoctlSearchHeader &hdr)
+	{
+		assert(m_type);
+		return hdr.type == m_type && hdr.objectid == m_objectid;
+	}
+
+	bool
+	BtrfsTreeOffsetFetcher::hdr_stop(const BtrfsIoctlSearchHeader &hdr)
+	{
+		assert(m_type);
+		return hdr.objectid > m_objectid || hdr.type > m_type;
+	}
+
+	void
+	BtrfsTreeOffsetFetcher::objectid(uint64_t objectid)
+	{
+		m_objectid = objectid;
+	}
+
+	uint64_t
+	BtrfsTreeOffsetFetcher::objectid() const
+	{
+		return m_objectid;
+	}
+
+	void
+	BtrfsTreeOffsetFetcher::fill_sk(BtrfsIoctlSearchKey &sk, uint64_t offset)
+	{
+		BtrfsTreeFetcher::fill_sk(sk, offset);
+		sk.min_offset = offset;
+		sk.max_offset = numeric_limits<decltype(sk.max_offset)>::max();
+		sk.min_objectid = m_objectid;
+		sk.max_objectid = m_objectid;
+	}
+
+	void
+	BtrfsCsumTreeFetcher::get_sums(uint64_t const logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t bytes)> output)
+	{
+	#if 0
+	#define BCTFGS_DEBUG(x) do { cerr << x; } while (false)
+	#else
+	#define BCTFGS_DEBUG(x) do { } while (false)
+	#endif
+		const uint64_t logical_end = logical + count * block_size();
+		BtrfsTreeItem bti = rlower_bound(logical);
+		size_t __attribute__((unused)) loops = 0;
+		BCTFGS_DEBUG("get_sums " << to_hex(logical) << ".." << to_hex(logical_end) << endl);
+		while (!!bti) {
+			BCTFGS_DEBUG("get_sums[" << loops << "]: " << bti << endl);
+			++loops;
+			// Reject wrong type or objectid
+			THROW_CHECK1(runtime_error, bti.type(), bti.type() == BTRFS_EXTENT_CSUM_KEY);
+			THROW_CHECK1(runtime_error, bti.objectid(), bti.objectid() == BTRFS_EXTENT_CSUM_OBJECTID);
+			// Is this object in range?
+			const uint64_t data_logical = bti.offset();
+			if (data_logical >= logical_end) {
+				// csum object is past end of range, we are done
+				return;
+			}
+			// Figure out how long this csum item is in various units
+			const size_t csum_byte_count = bti.data().size();
+			THROW_CHECK1(runtime_error, csum_byte_count, (csum_byte_count % m_sum_size) == 0);
+			THROW_CHECK1(runtime_error, csum_byte_count, csum_byte_count > 0);
+			const size_t csum_count = csum_byte_count / m_sum_size;
+			const uint64_t data_byte_count = csum_count * block_size();
+			const uint64_t data_logical_end = data_logical + data_byte_count;
+			if (data_logical_end <= logical) {
+				// too low, look at next item
+				bti = lower_bound(logical);
+				continue;
+			}
+			// There is some overlap?
+			const uint64_t overlap_begin = max(logical, data_logical);
+			const uint64_t overlap_end = min(logical_end, data_logical_end);
+			THROW_CHECK2(runtime_error, overlap_begin, overlap_end, overlap_begin < overlap_end);
+			const uint64_t overlap_offset = overlap_begin - data_logical;
+			THROW_CHECK1(runtime_error, overlap_offset, (overlap_offset % block_size()) == 0);
+			const uint64_t overlap_index = overlap_offset * m_sum_size / block_size();
+			const uint64_t overlap_byte_count = overlap_end - overlap_begin;
+			const uint64_t overlap_csum_byte_count = overlap_byte_count * m_sum_size / block_size();
+			// Can't be bigger than a btrfs item
+			THROW_CHECK1(runtime_error, overlap_index, overlap_index < 65536);
+			THROW_CHECK1(runtime_error, overlap_csum_byte_count, overlap_csum_byte_count < 65536);
+			// Yes, process the overlap
+			output(overlap_begin, bti.data().data() + overlap_index, overlap_csum_byte_count);
+			// Advance
+			bti = lower_bound(overlap_end);
+		}
+	#undef BCTFGS_DEBUG
+	}
+
+	uint32_t
+	BtrfsCsumTreeFetcher::sum_type() const
+	{
+		return m_sum_type;
+	}
+
+	size_t
+	BtrfsCsumTreeFetcher::sum_size() const
+	{
+		return m_sum_size;
+	}
+
+	BtrfsCsumTreeFetcher::BtrfsCsumTreeFetcher(const Fd &new_fd) :
+		BtrfsTreeOffsetFetcher(new_fd)
+	{
+		type(BTRFS_EXTENT_CSUM_KEY);
+		tree(BTRFS_CSUM_TREE_OBJECTID);
+		objectid(BTRFS_EXTENT_CSUM_OBJECTID);
+		BtrfsIoctlFsInfoArgs bifia;
+		bifia.do_ioctl(fd());
+		m_sum_type = static_cast<btrfs_compression_type>(bifia.csum_type());
+		m_sum_size = bifia.csum_size();
+		if (m_sum_type == BTRFS_CSUM_TYPE_CRC32 && m_sum_size == 0) {
+			// Older kernel versions don't fill in this field
+			m_sum_size = 4;
+		}
+		THROW_CHECK1(runtime_error, m_sum_size, m_sum_size > 0);
+	}
+
+	BtrfsExtentItemFetcher::BtrfsExtentItemFetcher(const Fd &new_fd) :
+		BtrfsTreeObjectFetcher(new_fd)
+	{
+		tree(BTRFS_EXTENT_TREE_OBJECTID);
+		type(BTRFS_EXTENT_ITEM_KEY);
+	}
+
+	BtrfsExtentDataFetcher::BtrfsExtentDataFetcher(const Fd &new_fd) :
+		BtrfsTreeOffsetFetcher(new_fd)
+	{
+		type(BTRFS_EXTENT_DATA_KEY);
+	}
+
+	BtrfsFsTreeFetcher::BtrfsFsTreeFetcher(const Fd &new_fd, uint64_t subvol) :
+		BtrfsTreeObjectFetcher(new_fd)
+	{
+		tree(subvol);
+		type(BTRFS_EXTENT_DATA_KEY);
+		scale_size(1);
+	}
+
+	BtrfsInodeFetcher::BtrfsInodeFetcher(const Fd &fd) :
+		BtrfsTreeObjectFetcher(fd)
+	{
+		type(BTRFS_INODE_ITEM_KEY);
+		scale_size(1);
+	}
+
+	BtrfsTreeItem
+	BtrfsInodeFetcher::stat(uint64_t subvol, uint64_t inode)
+	{
+		tree(subvol);
+		const auto item = at(inode);
+		if (!!item) {
+			THROW_CHECK2(runtime_error, item.objectid(), inode, inode == item.objectid());
+			THROW_CHECK2(runtime_error, item.type(), BTRFS_INODE_ITEM_KEY, item.type() == BTRFS_INODE_ITEM_KEY);
+		}
+		return item;
+	}
+
+	BtrfsRootFetcher::BtrfsRootFetcher(const Fd &fd) :
+		BtrfsTreeObjectFetcher(fd)
+	{
+		tree(BTRFS_ROOT_TREE_OBJECTID);
+		type(BTRFS_ROOT_ITEM_KEY);
+		scale_size(1);
+	}
+
+	BtrfsTreeItem
+	BtrfsRootFetcher::root(uint64_t subvol)
+	{
+		const auto item = at(subvol);
+		if (!!item) {
+			THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
+			THROW_CHECK2(runtime_error, item.type(), BTRFS_ROOT_ITEM_KEY, item.type() == BTRFS_ROOT_ITEM_KEY);
+		}
+		return item;
+	}
+}
--- a/lib/bytevector.cc
+++ b/lib/bytevector.cc
@@ -1,6 +1,10 @@
 #include "crucible/bytevector.h"

 #include "crucible/error.h"
+#include "crucible/hexdump.h"
+#include "crucible/string.h"
+
+#include <cassert>

 namespace crucible {
 	using namespace std;
@@ -8,12 +12,14 @@ namespace crucible {
 	ByteVector::iterator
 	ByteVector::begin() const
 	{
+		unique_lock<mutex> lock(m_mutex);
 		return m_ptr.get();
 	}

 	ByteVector::iterator
 	ByteVector::end() const
 	{
+		unique_lock<mutex> lock(m_mutex);
 		return m_ptr.get() + m_size;
 	}

@@ -32,6 +38,7 @@ namespace crucible {
 	void
 	ByteVector::clear()
 	{
+		unique_lock<mutex> lock(m_mutex);
 		m_ptr.reset();
 		m_size = 0;
 	}
@@ -39,9 +46,32 @@ namespace crucible {
 	ByteVector::value_type&
 	ByteVector::operator[](size_t size) const
 	{
+		unique_lock<mutex> lock(m_mutex);
 		return m_ptr.get()[size];
 	}

+	ByteVector::ByteVector(const ByteVector &that)
+	{
+		unique_lock<mutex> lock(that.m_mutex);
+		m_ptr = that.m_ptr;
+		m_size = that.m_size;
+	}
+
+	ByteVector&
+	ByteVector::operator=(const ByteVector &that)
+	{
+		// If &that == this, there's no need to do anything, but
+		// especially don't try to lock the same mutex twice.
+		if (&m_mutex != &that.m_mutex) {
+			unique_lock<mutex> lock_this(m_mutex, defer_lock);
+			unique_lock<mutex> lock_that(that.m_mutex, defer_lock);
+			lock(lock_this, lock_that);
+			m_ptr = that.m_ptr;
+			m_size = that.m_size;
+		}
+		return *this;
+	}
+
 	ByteVector::ByteVector(const ByteVector &that, size_t start, size_t length)
 	{
 		THROW_CHECK0(out_of_range, that.m_ptr);
@@ -60,6 +90,7 @@ namespace crucible {
 	ByteVector::value_type&
 	ByteVector::at(size_t size) const
 	{
+		unique_lock<mutex> lock(m_mutex);
 		THROW_CHECK0(out_of_range, m_ptr);
 		THROW_CHECK2(out_of_range, size, m_size, size < m_size);
 		return m_ptr.get()[size];
@@ -98,6 +129,9 @@ namespace crucible {
 	bool
 	ByteVector::operator==(const ByteVector &that) const
 	{
+		unique_lock<mutex> lock_this(m_mutex, defer_lock);
+		unique_lock<mutex> lock_that(that.m_mutex, defer_lock);
+		lock(lock_this, lock_that);
 		if (!m_ptr) {
 			return !that.m_ptr;
 		}
@@ -116,6 +150,7 @@ namespace crucible {
 	void
 	ByteVector::erase(iterator begin, iterator end)
 	{
+		unique_lock<mutex> lock(m_mutex);
 		const size_t size = end - begin;
 		if (!size) return;
 		THROW_CHECK0(out_of_range, m_ptr);
@@ -142,6 +177,14 @@ namespace crucible {
 	ByteVector::value_type*
 	ByteVector::data() const
 	{
+		unique_lock<mutex> lock(m_mutex);
 		return m_ptr.get();
 	}
+
+	ostream&
+	operator<<(ostream &os, const ByteVector &bv) {
+		unique_lock<mutex> lock(bv.m_mutex);
+		hexdump(os, bv);
+		return os;
+	}
 }
--- a/lib/fd.cc
+++ b/lib/fd.cc
@@ -361,7 +361,10 @@ namespace crucible {
                        THROW_ERROR(invalid_argument, "pwrite: trying to write on a closed file descriptor");
                }
 		int rv = ::pwrite(fd, buf, size, offset);
-		if (rv != static_cast<int>(size)) {
+		if (rv < 0) {
+			THROW_ERRNO("pwrite: could not write " << size << " bytes at fd " << name_fd(fd) << " offset " << offset);
+		}
+		if (rv != static_cast<ssize_t>(size)) {
 			THROW_ERROR(runtime_error, "pwrite: only " << rv << " of " << size << " bytes written at fd " << name_fd(fd) << " offset " << offset);
 		}
 	}
@@ -392,7 +395,7 @@ namespace crucible {
 				}
 				THROW_ERRNO("read: " << size << " bytes");
 			}
-			if (rv > static_cast<int>(size)) {
+			if (rv > static_cast<ssize_t>(size)) {
 				THROW_ERROR(runtime_error, "read: somehow read more bytes (" << rv << ") than requested (" << size << ")");
 			}
 			if (rv == 0) break;
@@ -441,7 +444,7 @@ namespace crucible {
 					}
 					THROW_ERRNO("pread: " << size << " bytes");
 				}
-				if (rv != static_cast<int>(size)) {
+				if (rv != static_cast<ssize_t>(size)) {
 					THROW_ERROR(runtime_error, "pread: " << size << " bytes at fd " << name_fd(fd) << " offset " << offset << " returned " << rv);
 				}
 				break;
@@ -521,7 +524,14 @@ namespace crucible {
 	void
 	ioctl_iflags_set(int fd, int attr)
 	{
-		DIE_IF_MINUS_ONE(ioctl(fd, FS_IOC_SETFLAGS, &attr));
+		// This bit of nonsense brought to you by Valgrind.
+		union {
+			int attr;
+			long zero;
+		} u;
+		u.zero = 0;
+		u.attr = attr;
+		DIE_IF_MINUS_ONE(ioctl(fd, FS_IOC_SETFLAGS, &u.attr));
 	}

 	string
--- a/lib/fs.cc
+++ b/lib/fs.cc
@@ -2,6 +2,7 @@

 #include "crucible/error.h"
 #include "crucible/fd.h"
+#include "crucible/hexdump.h"
 #include "crucible/limits.h"
 #include "crucible/ntoa.h"
 #include "crucible/string.h"
@@ -32,18 +33,6 @@ namespace crucible {
 #endif
 	}

-	BtrfsExtentInfo::BtrfsExtentInfo(int dst_fd, off_t dst_offset) :
-		btrfs_ioctl_same_extent_info( (btrfs_ioctl_same_extent_info) { } )
-	{
-		assert(fd == 0);
-		assert(logical_offset == 0);
-		assert(bytes_deduped == 0);
-		assert(status == 0);
-		assert(reserved == 0);
-		fd = dst_fd;
-		logical_offset = dst_offset;
-	}
-
 	BtrfsExtentSame::BtrfsExtentSame(int src_fd, off_t src_offset, off_t src_length) :
 		m_logical_offset(src_offset),
 		m_length(src_length),
@@ -56,9 +45,12 @@ namespace crucible {
 	}

 	void
-	BtrfsExtentSame::add(int fd, off_t offset)
+	BtrfsExtentSame::add(int const fd, uint64_t const offset)
 	{
-		m_info.push_back(BtrfsExtentInfo(fd, offset));
+		m_info.push_back( (btrfs_ioctl_same_extent_info) {
+			.fd = fd,
+			.logical_offset = offset,
+		});
 	}

 	ostream &
@@ -251,7 +243,7 @@ namespace crucible {
 			return os << "BtrfsIoctlLogicalInoArgs NULL";
 		}
 		os << "BtrfsIoctlLogicalInoArgs {";
-		os << " .logical = " << to_hex(p->logical);
+		os << " .m_logical = " << to_hex(p->m_logical);
 		os << " .inodes[] = {\n";
 		unsigned count = 0;
 		for (auto i = p->m_iors.cbegin(); i != p->m_iors.cend(); ++i) {
@@ -262,14 +254,10 @@ namespace crucible {
 	}

 	BtrfsIoctlLogicalInoArgs::BtrfsIoctlLogicalInoArgs(uint64_t new_logical, size_t new_size) :
-		btrfs_ioctl_logical_ino_args( (btrfs_ioctl_logical_ino_args) { } ),
 		m_container_size(new_size),
-		m_container(new_size)
+		m_container(new_size),
+		m_logical(new_logical)
 	{
-		assert(logical == 0);
-		assert(size == 0);
-		assert(flags == 0);
-		logical = new_logical;
 	}

 	size_t
@@ -308,11 +296,6 @@ namespace crucible {
 		return m_begin;
 	}

-	BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::operator vector<BtrfsInodeOffsetRoot>() const
-	{
-		return vector<BtrfsInodeOffsetRoot>(m_begin, m_end);
-	}
-
 	void
 	BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::clear()
 	{
@@ -322,23 +305,40 @@ namespace crucible {
 	void
 	BtrfsIoctlLogicalInoArgs::set_flags(uint64_t new_flags)
 	{
-		// We are still supporting building with old headers that don't have .flags yet
-		*(&reserved[0] + 3) = new_flags;
+		m_flags = new_flags;
 	}

 	uint64_t
 	BtrfsIoctlLogicalInoArgs::get_flags() const
 	{
 		// We are still supporting building with old headers that don't have .flags yet
-		return *(&reserved[0] + 3);
+		return m_flags;
+	}
+
+	void
+	BtrfsIoctlLogicalInoArgs::set_logical(uint64_t new_logical)
+	{
+		m_logical = new_logical;
+	}
+
+	void
+	BtrfsIoctlLogicalInoArgs::set_size(uint64_t new_size)
+	{
+		m_container_size = new_size;
 	}

 	bool
 	BtrfsIoctlLogicalInoArgs::do_ioctl_nothrow(int fd)
 	{
-		btrfs_ioctl_logical_ino_args *const p = static_cast<btrfs_ioctl_logical_ino_args *>(this);
-		inodes = reinterpret_cast<uint64_t>(m_container.prepare(m_container_size));
-		size = m_container.get_size();
+		btrfs_ioctl_logical_ino_args args = (btrfs_ioctl_logical_ino_args) {
+			.logical = m_logical,
+			.size = m_container_size,
+			.inodes = reinterpret_cast<uint64_t>(m_container.prepare(m_container_size)),
+		};
+		// We are still supporting building with old headers that don't have .flags yet
+		*(&args.reserved[0] + 3) = m_flags;
+
+		btrfs_ioctl_logical_ino_args *const p = &args;

 		m_iors.clear();

@@ -376,12 +376,12 @@ namespace crucible {
 		}

 		btrfs_data_container *const bdc = reinterpret_cast<btrfs_data_container *>(p->inodes);
-		BtrfsInodeOffsetRoot *const input_iter = reinterpret_cast<BtrfsInodeOffsetRoot *>(bdc->val);
+		BtrfsInodeOffsetRoot *const ior_iter = reinterpret_cast<BtrfsInodeOffsetRoot *>(bdc->val);

 		// elem_cnt counts uint64_t, but BtrfsInodeOffsetRoot is 3x uint64_t
 		THROW_CHECK1(runtime_error, bdc->elem_cnt, bdc->elem_cnt % 3 == 0);
-		m_iors.m_begin = input_iter;
-		m_iors.m_end = input_iter + bdc->elem_cnt / 3;
+		m_iors.m_begin = ior_iter;
+		m_iors.m_end = ior_iter + bdc->elem_cnt / 3;
 		return true;
 	}

@@ -520,9 +520,10 @@ namespace crucible {
 	}

 	string
-	btrfs_ioctl_defrag_range_compress_type_ntoa(uint32_t compress_type)
+	btrfs_compress_type_ntoa(uint8_t compress_type)
 	{
 		static const bits_ntoa_table table[] = {
+			NTOA_TABLE_ENTRY_ENUM(BTRFS_COMPRESS_NONE),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_COMPRESS_ZLIB),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_COMPRESS_LZO),
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_COMPRESS_ZSTD),
@@ -542,7 +543,7 @@ namespace crucible {
 		os << " .len = " << p->len;
 		os << " .flags = " << btrfs_ioctl_defrag_range_flags_ntoa(p->flags);
 		os << " .extent_thresh = " << p->extent_thresh;
-		os << " .compress_type = " << btrfs_ioctl_defrag_range_compress_type_ntoa(p->compress_type);
+		os << " .compress_type = " << btrfs_compress_type_ntoa(p->compress_type);
 		os << " .unused[4] = { " << p->unused[0] << ", " << p->unused[1] << ", " << p->unused[2] << ", " << p->unused[3] << "} }";
 		return os;
 	}
@@ -865,30 +866,6 @@ namespace crucible {
 		}
 	}

-	template <class V>
-	ostream &
-	hexdump(ostream &os, const V &v)
-	{
-		os << "V { size = " << v.size() << ", data:\n";
-		for (size_t i = 0; i < v.size(); i += 8) {
-			string hex, ascii;
-			for (size_t j = i; j < i + 8; ++j) {
-				if (j < v.size()) {
-					uint8_t c = v[j];
-					char buf[8];
-					sprintf(buf, "%02x ", c);
-					hex += buf;
-					ascii += (c < 32 || c > 126) ? '.' : c;
-				} else {
-					hex += "   ";
-					ascii += ' ';
-				}
-			}
-			os << astringprintf("\t%08x %s %s\n", i, hex.c_str(), ascii.c_str());
-		}
-		return os << "}";
-	}
-
 	string
 	btrfs_search_type_ntoa(unsigned type)
 	{
--- a/lib/multilock.cc
+++ b/lib/multilock.cc
@@ -0,0 +1,72 @@
+#include "crucible/multilock.h"
+
+#include "crucible/error.h"
+
+namespace crucible {
+	using namespace std;
+
+	MultiLocker::LockHandle::LockHandle(const string &type, MultiLocker &parent) :
+		m_type(type),
+		m_parent(parent)
+	{
+	}
+
+	void
+	MultiLocker::LockHandle::set_locked(const bool state)
+	{
+		m_locked = state;
+	}
+
+	MultiLocker::LockHandle::~LockHandle()
+	{
+		if (m_locked) {
+			m_parent.put_lock(m_type);
+			m_locked = false;
+		}
+	}
+
+	bool
+	MultiLocker::is_lock_available(const string &type)
+	{
+		for (const auto &i : m_counters) {
+			if (i.second != 0 && i.first != type) {
+				return false;
+			}
+		}
+		return true;
+	}
+
+	void
+	MultiLocker::put_lock(const string &type)
+	{
+		unique_lock<mutex> lock(m_mutex);
+		auto &counter = m_counters[type];
+		THROW_CHECK2(runtime_error, type, counter, counter > 0);
+		--counter;
+		if (counter == 0) {
+			m_cv.notify_all();
+		}
+	}
+
+	shared_ptr<MultiLocker::LockHandle>
+	MultiLocker::get_lock_private(const string &type)
+	{
+		unique_lock<mutex> lock(m_mutex);
+		m_counters.insert(make_pair(type, size_t(0)));
+		while (!is_lock_available(type)) {
+			m_cv.wait(lock);
+		}
+		const auto rv = make_shared<LockHandle>(type, *this);
+		++m_counters[type];
+		rv->set_locked(true);
+		return rv;
+	}
+
+	shared_ptr<MultiLocker::LockHandle>
+	MultiLocker::get_lock(const string &type)
+	{
+		static MultiLocker s_process_instance;
+		return s_process_instance.get_lock_private(type);
+	}
+
+}
--- a/lib/task.cc
+++ b/lib/task.cc
@@ -18,6 +18,27 @@
 namespace crucible {
 	using namespace std;

+	static const size_t thread_name_length = 15; // TASK_COMM_LEN on Linux
+
+	void
+	pthread_setname(const string &name)
+	{
+		auto name_copy = name.substr(0, thread_name_length);
+		// Don't care if a debugging facility fails
+		pthread_setname_np(pthread_self(), name_copy.c_str());
+	}
+
+	string
+	pthread_getname()
+	{
+		char buf[thread_name_length + 1] = { 0 };
+		// We'll get an empty name if this fails...
+		pthread_getname_np(pthread_self(), buf, sizeof(buf));
+		// ...or at least null-terminated garbage
+		buf[thread_name_length] = '\0';
+		return buf;
+	}
+
 	class TaskState;
 	using TaskStatePtr = shared_ptr<TaskState>;
 	using TaskStateWeak = weak_ptr<TaskState>;
@@ -30,7 +51,8 @@ namespace crucible {

 	static thread_local TaskStatePtr tl_current_task;

-	/// because we don't want to bump -std=c++-17 just to get scoped_lock
+	/// because we don't want to bump -std=c++-17 just to get scoped_lock.
+	/// Also we don't want to self-deadlock if both mutexes are the same mutex.
 	class PairLock {
 		unique_lock<mutex>	m_lock1, m_lock2;
 	public:
@@ -54,8 +76,8 @@ namespace crucible {
 		/// Tasks to be executed after the current task is executed
 		list<TaskStatePtr>			m_post_exec_queue;

-		/// Incremented by run() and append().  Decremented by exec().
-		size_t					m_run_count = 0;
+		/// Set by run() and append().  Cleared by exec().
+		bool					m_run_now = false;

 		/// Set when task starts execution by exec().
 		/// Cleared when exec() ends.
@@ -137,6 +159,8 @@ namespace crucible {
 		size_t					m_configured_thread_max;
 		double					m_thread_target;
 		bool					m_cancelled = false;
+		bool					m_paused = false;
+		TaskMaster::LoadStats			m_load_stats;

 	friend class TaskConsumer;
 	friend class TaskMaster;
@@ -150,6 +174,7 @@ namespace crucible {
 		void set_loadavg_target(double target);
 		void loadavg_thread_fn();
 		void cancel();
+		void pause(bool paused = true);

 		TaskMasterState &operator=(const TaskMasterState &) = delete;
 		TaskMasterState(const TaskMasterState &) = delete;
@@ -162,6 +187,7 @@ namespace crucible {
 		static void push_front(TaskQueue &queue);
 		size_t get_queue_count();
 		size_t get_thread_count();
+		static TaskMaster::LoadStats get_current_load();
 	};

 	class TaskConsumer : public enable_shared_from_this<TaskConsumer> {
@@ -193,7 +219,7 @@ namespace crucible {
 		if (queue.empty()) {
 			return;
 		}
-		auto tlcc = tl_current_consumer;
+		const auto tlcc = tl_current_consumer;
 		if (tlcc) {
 			// We are executing under a TaskConsumer, splice our post-exec queue at front.
 			// No locks needed because we are using only thread-local objects.
@@ -218,6 +244,9 @@ namespace crucible {
 	TaskState::~TaskState()
 	{
 		--s_instance_count;
+		unique_lock<mutex> lock(m_mutex);
+		// If any dependent Tasks were appended since the last exec, run them now
+		TaskState::rescue_queue(m_post_exec_queue);
 	}

 	TaskState::TaskState(string title, function<void()> exec_fn) :
@@ -254,11 +283,10 @@ namespace crucible {
 	void
 	TaskState::clear_queue(TaskQueue &tq)
 	{
-		while (!tq.empty()) {
-			auto i = *tq.begin();
-			tq.pop_front();
+		for (auto &i : tq) {
 			i->clear();
 		}
+		tq.clear();
 	}

 	void
@@ -273,8 +301,8 @@ namespace crucible {
 	{
 		THROW_CHECK0(invalid_argument, task);
 		PairLock lock(m_mutex, task->m_mutex);
-		if (!task->m_run_count) {
-			++task->m_run_count;
+		if (!task->m_run_now) {
+			task->m_run_now = true;
 			append_nolock(task);
 		}
 	}
@@ -290,7 +318,7 @@ namespace crucible {
 			append_nolock(shared_from_this());
 			return;
 		} else {
-			--m_run_count;
+			m_run_now = false;
 			m_is_running = true;
 		}

@@ -298,15 +326,14 @@ namespace crucible {
 		swap(this_task, tl_current_task);
 		lock.unlock();

-		char buf[24] = { 0 };
-		DIE_IF_MINUS_ERRNO(pthread_getname_np(pthread_self(), buf, sizeof(buf)));
-		DIE_IF_MINUS_ERRNO(pthread_setname_np(pthread_self(), m_title.c_str()));
+		const auto old_thread_name = pthread_getname();
+		pthread_setname(m_title);

 		catch_all([&]() {
 			m_exec_fn();
 		});

-		pthread_setname_np(pthread_self(), buf);
+		pthread_setname(old_thread_name);

 		lock.lock();
 		swap(this_task, tl_current_task);
@@ -333,24 +360,25 @@ namespace crucible {
 	TaskState::run()
 	{
 		unique_lock<mutex> lock(m_mutex);
-		if (m_run_count) {
+		if (m_run_now) {
 			return;
 		}
-		++m_run_count;
+		m_run_now = true;
 		TaskMasterState::push_back(shared_from_this());
 	}

 	TaskMasterState::TaskMasterState(size_t thread_max) :
 		m_thread_max(thread_max),
 		m_configured_thread_max(thread_max),
-		m_thread_target(thread_max)
+		m_thread_target(thread_max),
+		m_load_stats(TaskMaster::LoadStats { 0 })
 	{
 	}

 	void
 	TaskMasterState::start_threads_nolock()
 	{
-		while (m_threads.size() < m_thread_max) {
+		while (m_threads.size() < m_thread_max && !m_paused) {
 			m_threads.insert(make_shared<TaskConsumer>(shared_from_this()));
 		}
 	}
@@ -417,6 +445,13 @@ namespace crucible {
 		return s_tms->m_threads.size();
 	}

+	TaskMaster::LoadStats
+	TaskMaster::get_current_load()
+	{
+		unique_lock<mutex> lock(s_tms->m_mutex);
+		return s_tms->m_load_stats;
+	}
+
 	ostream &
 	TaskMaster::print_queue(ostream &os)
 	{
@@ -451,8 +486,8 @@ namespace crucible {
 	size_t
 	TaskMasterState::calculate_thread_count_nolock()
 	{
-		if (m_cancelled) {
-			// No threads running while cancelled
+		if (m_paused) {
+			// No threads running while paused or cancelled
 			return 0;
 		}

@@ -484,19 +519,21 @@ namespace crucible {

 		m_prev_loadavg = loadavg;

-		// Change the thread target based on the
-		// difference between current and desired load
-		// but don't get too close all at once due to rounding and sample error.
-		// If m_load_target < 1.0 then we are just doing PWM with one thread.
-
-		if (m_load_target <= 1.0) {
-			m_thread_target = 1.0;
-		} else if (m_load_target - current_load >= 1.0) {
-			m_thread_target += (m_load_target - current_load - 1.0) / 2.0;
-		} else if (m_load_target < current_load) {
-			m_thread_target += m_load_target - current_load;
+		const double load_deficit = m_load_target - loadavg;
+		if (load_deficit > 0) {
+			// Load is too low, solve by adding another worker
+			m_thread_target += load_deficit / 3;
+		} else if (load_deficit < 0) {
+			// Load is too high, solve by removing all known excess tasks
+			m_thread_target += load_deficit;
 		}

+		m_load_stats = TaskMaster::LoadStats {
+			.current_load = current_load,
+			.thread_target = m_thread_target,
+			.loadavg = loadavg,
+		};
+
 		// Cannot exceed configured maximum thread count or less than zero
 		m_thread_target = min(max(0.0, m_thread_target), double(m_configured_thread_max));

@@ -526,12 +563,6 @@ namespace crucible {
 	TaskMasterState::set_thread_count(size_t thread_max)
 	{
 		unique_lock<mutex> lock(m_mutex);
-		// XXX: someday we might want to uncancel, and this would be the place to do it;
-		// however, when we cancel we destroy the entire Task queue, and that might be
-		// non-trivial to recover from
-		if (m_cancelled) {
-			return;
-		}
 		m_configured_thread_max = thread_max;
 		lock.unlock();
 		adjust_thread_count();
@@ -548,6 +579,7 @@ namespace crucible {
 	TaskMasterState::cancel()
 	{
 		unique_lock<mutex> lock(m_mutex);
+		m_paused = true;
 		m_cancelled = true;
 		decltype(m_queue) empty_queue;
 		m_queue.swap(empty_queue);
@@ -562,14 +594,25 @@ namespace crucible {
 		s_tms->cancel();
 	}

+	void
+	TaskMasterState::pause(const bool paused)
+	{
+		unique_lock<mutex> lock(m_mutex);
+		m_paused = paused;
+		m_condvar.notify_all();
+		lock.unlock();
+	}
+
+	void
+	TaskMaster::pause(const bool paused)
+	{
+		s_tms->pause(paused);
+	}
+
 	void
 	TaskMasterState::set_thread_min_count(size_t thread_min)
 	{
 		unique_lock<mutex> lock(m_mutex);
-		// XXX: someday we might want to uncancel, and this would be the place to do it
-		if (m_cancelled) {
-			return;
-		}
 		m_thread_min = thread_min;
 		lock.unlock();
 		adjust_thread_count();
@@ -585,7 +628,7 @@ namespace crucible {
 	void
 	TaskMasterState::loadavg_thread_fn()
 	{
-		pthread_setname_np(pthread_self(), "load_tracker");
+		pthread_setname("load_tracker");
 		while (!m_cancelled) {
 			adjust_thread_count();
 			nanosleep(5.0);
@@ -701,7 +744,7 @@ namespace crucible {
 	TaskConsumer::consumer_thread()
 	{
 		// Keep a copy because we will be destroying *this later
-		auto master_copy = m_master;
+		const auto master_copy = m_master;

 		// Constructor is running with master locked.
 		// Wait until that is done before trying to do anything.
@@ -711,13 +754,13 @@ namespace crucible {
 		m_thread->detach();

 		// Set thread name so it isn't empty or the name of some other thread
-		DIE_IF_MINUS_ERRNO(pthread_setname_np(pthread_self(), "task_consumer"));
+		pthread_setname("task_consumer");

 		// It is now safe to access our own shared_ptr
 		TaskConsumerPtr this_consumer = shared_from_this();
 		swap(this_consumer, tl_current_consumer);

-		while (!master_copy->m_cancelled) {
+		while (!master_copy->m_paused) {
 			if (master_copy->m_thread_max < master_copy->m_threads.size()) {
 				// We are one of too many threads, exit now
 				break;
@@ -788,24 +831,16 @@ namespace crucible {
 		void insert_task(Task t);
 	};

-	Barrier::Barrier(shared_ptr<BarrierState> pbs) :
-		m_barrier_state(pbs)
-	{
-	}
-
-	Barrier::Barrier() :
-		m_barrier_state(make_shared<BarrierState>())
-	{
-	}
-
 	void
 	BarrierState::release()
 	{
+		set<Task> tasks_local;
 		unique_lock<mutex> lock(m_mutex);
-		for (auto i : m_tasks) {
+		swap(tasks_local, m_tasks);
+		lock.unlock();
+		for (const auto &i : tasks_local) {
 			i.run();
 		}
-		m_tasks.clear();
 	}

 	BarrierState::~BarrierState()
@@ -813,17 +848,6 @@ namespace crucible {
 		release();
 	}

-	BarrierLock::BarrierLock(shared_ptr<BarrierState> pbs) :
-		m_barrier_state(pbs)
-	{
-	}
-
-	void
-	BarrierLock::release()
-	{
-		m_barrier_state.reset();
-	}
-
 	void
 	BarrierState::insert_task(Task t)
 	{
@@ -831,122 +855,69 @@ namespace crucible {
 		m_tasks.insert(t);
 	}

+	Barrier::Barrier() :
+		m_barrier_state(make_shared<BarrierState>())
+	{
+	}
+
 	void
 	Barrier::insert_task(Task t)
 	{
 		m_barrier_state->insert_task(t);
 	}

-	BarrierLock
-	Barrier::lock()
-	{
-		return BarrierLock(m_barrier_state);
-	}
-
-	class ExclusionState {
-		mutex		m_mutex;
-		bool		m_locked = false;
-		Task		m_task;
-
-	public:
-		ExclusionState(const string &title);
-		~ExclusionState();
-		void release();
-		bool try_lock();
-		void insert_task(Task t);
-	};
-
-	Exclusion::Exclusion(shared_ptr<ExclusionState> pbs) :
-		m_exclusion_state(pbs)
-	{
-	}
-
-	Exclusion::Exclusion(const string &title) :
-		m_exclusion_state(make_shared<ExclusionState>(title))
-	{
-	}
-
-	ExclusionState::ExclusionState(const string &title) :
-		m_task(title, [](){})
-	{
-	}
-
 	void
-	ExclusionState::release()
+	Barrier::release()
 	{
-		unique_lock<mutex> lock(m_mutex);
-		m_locked = false;
-		m_task.run();
+		m_barrier_state.reset();
 	}

-	ExclusionState::~ExclusionState()
-	{
-		release();
-	}
-
-	ExclusionLock::ExclusionLock(shared_ptr<ExclusionState> pbs) :
-		m_exclusion_state(pbs)
+	ExclusionLock::ExclusionLock(shared_ptr<Task> owner) :
+		m_owner(owner)
 	{
 	}

 	void
 	ExclusionLock::release()
 	{
-		if (m_exclusion_state) {
-			m_exclusion_state->release();
-			m_exclusion_state.reset();
-		}
-	}
-
-	ExclusionLock::~ExclusionLock()
-	{
-		release();
+		m_owner.reset();
 	}

 	void
-	ExclusionState::insert_task(Task task)
+	Exclusion::insert_task(const Task &task)
 	{
 		unique_lock<mutex> lock(m_mutex);
-		if (m_locked) {
+		const auto sp = m_owner.lock();
+		lock.unlock();
+		if (sp) {
 			// If Exclusion is locked then queue task for release;
-			m_task.append(task);
+			sp->append(task);
 		} else {
 			// otherwise, run the inserted task immediately
 			task.run();
 		}
 	}

-	bool
-	ExclusionState::try_lock()
+	ExclusionLock
+	Exclusion::try_lock(const Task &task)
 	{
 		unique_lock<mutex> lock(m_mutex);
-		if (m_locked) {
-			return false;
+		const auto sp = m_owner.lock();
+		if (sp) {
+			if (task) {
+				sp->append(task);
+			}
+			return ExclusionLock();
 		} else {
-			m_locked = true;
-			return true;
+			const auto rv = make_shared<Task>(task);
+			m_owner = rv;
+			return ExclusionLock(rv);
 		}
 	}

-	void
-	Exclusion::insert_task(Task t)
-	{
-		m_exclusion_state->insert_task(t);
-	}
-
 	ExclusionLock::operator bool() const
 	{
-		return !!m_exclusion_state;
+		return !!m_owner;
 	}

-	ExclusionLock
-	Exclusion::try_lock()
-	{
-		THROW_CHECK0(runtime_error, m_exclusion_state);
-		if (m_exclusion_state->try_lock()) {
-			return ExclusionLock(m_exclusion_state);
-		} else {
-			return ExclusionLock();
-		}
-	}
 }
--- a/scripts/beesd.in
+++ b/scripts/beesd.in
@@ -15,7 +15,7 @@ readonly AL128K="$((128*1024))"
 readonly AL16M="$((16*1024*1024))"
 readonly CONFIG_DIR=@ETC_PREFIX@/bees/

-readonly bees_bin=$(realpath @LIBEXEC_PREFIX@/bees)
+readonly bees_bin=$(realpath @DESTDIR@/@LIBEXEC_PREFIX@/bees)

 command -v "$bees_bin" &> /dev/null || ERRO "Missing 'bees' agent"

@@ -128,7 +128,7 @@ fi
    fi
    if (( "$OLD_SIZE" != "$NEW_SIZE" )); then
        INFO "Resize db: $OLD_SIZE -> $NEW_SIZE"
-        [ -f "$BEESHOME/beescrawl.$UUID.dat" ] && rm "$BEESHOME/beescrawl.$UUID.dat"
+        rm -f "$BEESHOME/beescrawl.dat"
        truncate -s $NEW_SIZE $DB_PATH
    fi
    chmod 700 "$DB_PATH"
--- a/src/Makefile
+++ b/src/Makefile
@@ -1,11 +1,6 @@
 BEES = ../bin/bees
-PROGRAMS = \
-	../bin/fiemap \
-	../bin/fiewalk \

-PROGRAM_OBJS = $(foreach b,$(PROGRAMS),$(patsubst ../bin/%,%.o,$(b)))
-
-all: $(BEES) $(PROGRAMS)
+all: $(BEES)

 include ../makeflags
 -include ../localconf
@@ -25,25 +20,18 @@ BEES_OBJS = \

 ALL_OBJS = $(BEES_OBJS) $(PROGRAM_OBJS)

-bees-version.c: bees.h $(BEES_OBJS:.o=.cc) Makefile
-	echo "const char *BEES_VERSION = \"$(BEES_VERSION)\";" > bees-version.new.c
-	mv -f bees-version.new.c bees-version.c
+bees-version.c: bees.h $(BEES_OBJS:.o=.cc) Makefile ../lib/libcrucible.a
+	echo "const char *BEES_VERSION = \"$(BEES_VERSION)\";" > bees-version.c.new
+	if ! [ -e "$@" ] || ! cmp -s "$@.new" "$@"; then mv -fv $@.new $@; fi

 bees-usage.c: bees-usage.txt Makefile
 	(echo 'const char *BEES_USAGE = '; sed -r 's/^(.*)$$/"\1\\n"/' < bees-usage.txt; echo ';') > bees-usage.new.c
 	mv -f bees-usage.new.c bees-usage.c

-.depends:
-	mkdir -p $@
-
-.depends/%.dep: %.cc Makefile | .depends
+%.dep: %.cc Makefile
 	$(CXX) $(BEES_CXXFLAGS) -M -MF $@ -MT $(<:.cc=.o) $<

-depends.mk: $(ALL_OBJS:%.o=.depends/%.dep)
-	cat $^ > $@.new
-	mv -f $@.new $@
-
-include depends.mk
+include $(ALL_OBJS:%.o=%.dep)

 %.o: %.c ../makeflags
 	$(CC) $(BEES_CFLAGS) -o $@ -c $<
@@ -51,11 +39,6 @@ include depends.mk
 %.o: %.cc ../makeflags
 	$(CXX) $(BEES_CXXFLAGS) -o $@ -c $<

-$(PROGRAMS): ../bin/%: %.o
-	$(CXX) $(BEES_CXXFLAGS) $(BEES_LDFLAGS) -o $@ $< $(LIBS)
-
-$(PROGRAMS): ../lib/libcrucible.a
-
 $(BEES): $(BEES_OBJS) bees-version.o bees-usage.o ../lib/libcrucible.a
 	$(CXX) $(BEES_CXXFLAGS) $(BEES_LDFLAGS) -o $@ $^ $(LIBS)

--- a/src/bees-context.cc
+++ b/src/bees-context.cc
@@ -43,12 +43,13 @@ BeesFdCache::BeesFdCache(shared_ptr<BeesContext> ctx) :
 void
 BeesFdCache::clear()
 {
-	BEESNOTE("Clearing root FD cache to enable subvol delete");
-	BEESLOGDEBUG("Clearing root FD cache to enable subvol delete");
+	BEESLOGDEBUG("Clearing root FD cache with size " << m_root_cache.size() << " to enable subvol delete");
+	BEESNOTE("Clearing root FD cache with size " << m_root_cache.size());
 	m_root_cache.clear();
 	BEESCOUNT(root_clear);
-	BEESLOGDEBUG("Clearing open FD cache to enable file delete");
-	BEESNOTE("Clearing open FD cache to enable file delete");
+
+	BEESLOGDEBUG("Clearing open FD cache with size " << m_file_cache.size() << " to enable file delete");
+	BEESNOTE("Clearing open FD cache with size " << m_file_cache.size());
 	m_file_cache.clear();
 	BEESCOUNT(open_clear);
 }
@@ -84,11 +85,11 @@ BeesContext::dump_status()
 		ofs << "RATES:\n";
 		ofs << "\t" << avg_rates << "\n";

-		ofs << "THREADS (work queue " << TaskMaster::get_queue_count() << " of " << Task::instance_count() << " tasks, " << TaskMaster::get_thread_count() << " workers):\n";
+		const auto load_stats = TaskMaster::get_current_load();
+		ofs << "THREADS (work queue " << TaskMaster::get_queue_count() << " of " << Task::instance_count() << " tasks, " << TaskMaster::get_thread_count() << " workers, load: current " << load_stats.current_load << " target " << load_stats.thread_target << " average " << load_stats.loadavg << "):\n";
 		for (auto t : BeesNote::get_status()) {
 			ofs << "\ttid " << t.first << ": " << t.second << "\n";
 		}
-
 #if 0
 		// Huge amount of data, not a lot of information (yet)
 		ofs << "WORKERS:\n";
@@ -152,8 +153,8 @@ BeesContext::show_progress()
 		BEESLOGINFO("\t" << deltaRates);

 		BEESNOTE("logging current thread status");
-		BEESLOGINFO("THREADS:");
-
+		const auto load_stats = TaskMaster::get_current_load();
+		BEESLOGINFO("THREADS (work queue " << TaskMaster::get_queue_count() << " of " << Task::instance_count() << " tasks, " << TaskMaster::get_thread_count() << " workers, load: current " << load_stats.current_load << " target " << load_stats.thread_target << " average " << load_stats.loadavg << "):");
 		for (auto t : BeesNote::get_status()) {
 			BEESLOGINFO("\ttid " << t.first << ": " << t.second);
 		}
@@ -207,11 +208,6 @@ BeesContext::dedup(const BeesRangePair &brp_in)
 	BeesAddress first_addr(brp.first.fd(), brp.first.begin());
 	BeesAddress second_addr(brp.second.fd(), brp.second.begin());

-	BEESLOGINFO("dedup: src " << pretty(brp.first.size())  << " [" << to_hex(brp.first.begin())  << ".." << to_hex(brp.first.end())  << "] {" << first_addr  << "} " << name_fd(brp.first.fd()) << "\n"
-		 << "       dst " << pretty(brp.second.size()) << " [" << to_hex(brp.second.begin()) << ".." << to_hex(brp.second.end()) << "] {" << second_addr << "} " << name_fd(brp.second.fd()));
-	BEESNOTE("dedup: src " << pretty(brp.first.size())  << " [" << to_hex(brp.first.begin())  << ".." << to_hex(brp.first.end())  << "] {" << first_addr  << "} " << name_fd(brp.first.fd()) << "\n"
-		 << "       dst " << pretty(brp.second.size()) << " [" << to_hex(brp.second.begin()) << ".." << to_hex(brp.second.end()) << "] {" << second_addr << "} " << name_fd(brp.second.fd()));
-
 	if (first_addr.get_physical_or_zero() == second_addr.get_physical_or_zero()) {
 		BEESLOGTRACE("equal physical addresses in dedup");
 		BEESCOUNT(bug_dedup_same_physical);
@@ -221,8 +217,18 @@ BeesContext::dedup(const BeesRangePair &brp_in)
 	THROW_CHECK1(invalid_argument, brp, brp.first.size() == brp.second.size());

 	BEESCOUNT(dedup_try);
+
+	BEESNOTE("waiting to dedup " << brp);
+	const auto lock = MultiLocker::get_lock("dedupe");
+
 	Timer dedup_timer;
-	bool rv = btrfs_extent_same(brp.first.fd(), brp.first.begin(), brp.first.size(), brp.second.fd(), brp.second.begin());
+
+	BEESLOGINFO("dedup: src " << pretty(brp.first.size())  << " [" << to_hex(brp.first.begin())  << ".." << to_hex(brp.first.end())  << "] {" << first_addr  << "} " << name_fd(brp.first.fd()) << "\n"
+		 << "       dst " << pretty(brp.second.size()) << " [" << to_hex(brp.second.begin()) << ".." << to_hex(brp.second.end()) << "] {" << second_addr << "} " << name_fd(brp.second.fd()));
+	BEESNOTE("dedup: src " << pretty(brp.first.size())  << " [" << to_hex(brp.first.begin())  << ".." << to_hex(brp.first.end())  << "] {" << first_addr  << "} " << name_fd(brp.first.fd()) << "\n"
+		 << "       dst " << pretty(brp.second.size()) << " [" << to_hex(brp.second.begin()) << ".." << to_hex(brp.second.end()) << "] {" << second_addr << "} " << name_fd(brp.second.fd()));
+
+	const bool rv = btrfs_extent_same(brp.first.fd(), brp.first.begin(), brp.first.size(), brp.second.fd(), brp.second.begin());
 	BEESCOUNTADD(dedup_ms, dedup_timer.age() * 1000);

 	if (rv) {
@@ -328,29 +334,23 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 	if (e.flags() & Extent::PREALLOC) {
 		// Prealloc is all zero and we replace it with a hole.
 		// No special handling is required here.  Nuke it and move on.
-		Task(
-			"dedup_prealloc",
-			[m_ctx, bfr, e]() {
-				BEESLOGINFO("prealloc extent " << e);
-				// Must not extend past EOF
-				auto extent_size = min(e.end(), bfr.file_size()) - e.begin();
-				// Must hold tmpfile until dedupe is done
-				auto tmpfile = m_ctx->tmpfile();
-				BeesFileRange prealloc_bfr(tmpfile->make_hole(extent_size));
-				// Apparently they can both extend past EOF
-				BeesFileRange copy_bfr(bfr.fd(), e.begin(), e.begin() + extent_size);
-				BeesRangePair brp(prealloc_bfr, copy_bfr);
-				// Raw dedupe here - nothing else to do with this extent, nothing to merge with
-				if (m_ctx->dedup(brp)) {
-					BEESCOUNT(dedup_prealloc_hit);
-					BEESCOUNTADD(dedup_prealloc_bytes, e.size());
-					// return bfr;
-				} else {
-					BEESCOUNT(dedup_prealloc_miss);
-				}
-			}
-		).run();
-		return bfr; // if dedupe success, which we now blindly assume
+		BEESLOGINFO("prealloc extent " << e);
+		// Must not extend past EOF
+		auto extent_size = min(e.end(), bfr.file_size()) - e.begin();
+		// Must hold tmpfile until dedupe is done
+		const auto tmpfile = m_ctx->tmpfile();
+		BeesFileRange prealloc_bfr(tmpfile->make_hole(extent_size));
+		// Apparently they can both extend past EOF
+		BeesFileRange copy_bfr(bfr.fd(), e.begin(), e.begin() + extent_size);
+		BeesRangePair brp(prealloc_bfr, copy_bfr);
+		// Raw dedupe here - nothing else to do with this extent, nothing to merge with
+		if (m_ctx->dedup(brp)) {
+			BEESCOUNT(dedup_prealloc_hit);
+			BEESCOUNTADD(dedup_prealloc_bytes, e.size());
+			return bfr;
+		} else {
+			BEESCOUNT(dedup_prealloc_miss);
+		}
 	}

 	// OK we need to read extent now
@@ -602,57 +602,6 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 		BEESCOUNT(scan_zero_compressed);
 	}

-	// Turning this off because it's a waste of time on small extents
-	// and it's incorrect for large extents.
-#if 0
-	// If the extent contains obscured blocks, and we can find no
-	// other refs to the extent that reveal those blocks, nuke the incoming extent.
-	// Don't rewrite extents that are bigger than the maximum FILE_EXTENT_SAME size
-	// because we can't make extents that large with dedupe.
-	// Don't rewrite small extents because it is a waste of time without being
-	// able to combine them into bigger extents.
-	if (!rewrite_extent && (e.flags() & Extent::OBSCURED) && (e.physical_len() > BLOCK_SIZE_MAX_COMPRESSED_EXTENT) && (e.physical_len() < BLOCK_SIZE_MAX_EXTENT_SAME)) {
-		BEESCOUNT(scan_obscured);
-		BEESNOTE("obscured extent " << e);
-		// We have to map all the source blocks to see if any of them
-		// (or all of them aggregated) provide a path through the FS to the blocks
-		BeesResolver br(m_ctx, BeesAddress(e, e.begin()));
-		BeesBlockData ref_bbd(bfr.fd(), bfr.begin(), min(BLOCK_SIZE_SUMS, bfr.size()));
-		// BEESLOG("ref_bbd " << ref_bbd);
-		auto bfr_set = br.find_all_matches(ref_bbd);
-		bool non_obscured_extent_found = false;
-		set<off_t> blocks_to_find;
-		for (off_t j = 0; j < e.physical_len(); j += BLOCK_SIZE_CLONE) {
-			blocks_to_find.insert(j);
-		}
-		// Don't bother if saving less than 1%
-		auto maximum_hidden_count = blocks_to_find.size() / 100;
-		for (auto i : bfr_set) {
-			BtrfsExtentWalker ref_ew(bfr.fd(), bfr.begin(), m_ctx->root_fd());
-			Extent ref_e = ref_ew.current();
-			// BEESLOG("\tref_e " << ref_e);
-			THROW_CHECK2(out_of_range, ref_e, e, ref_e.offset() + ref_e.logical_len() <= e.physical_len());
-			for (off_t j = ref_e.offset(); j < ref_e.offset() + ref_e.logical_len(); j += BLOCK_SIZE_CLONE) {
-				blocks_to_find.erase(j);
-			}
-			if (blocks_to_find.size() <= maximum_hidden_count) {
-				BEESCOUNT(scan_obscured_miss);
-				BEESLOG("Found references to all but " << blocks_to_find.size() << " blocks");
-				non_obscured_extent_found = true;
-				break;
-			} else {
-				BEESCOUNT(scan_obscured_hit);
-				// BEESLOG("blocks_to_find: " << blocks_to_find.size() << " from " << *blocks_to_find.begin() << ".." << *blocks_to_find.rbegin());
-			}
-		}
-		if (!non_obscured_extent_found) {
-			// BEESLOG("No non-obscured extents found");
-			rewrite_extent = true;
-			BEESCOUNT(scan_obscured_rewrite);
-		}
-	}
-#endif
-
 	// If we deduped any blocks then we must rewrite the remainder of the extent
 	if (!noinsert_set.empty()) {
 		rewrite_extent = true;
@@ -724,7 +673,13 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
 	return bfr;
 }

-BeesFileRange
+shared_ptr<Exclusion>
+BeesContext::get_inode_mutex(const uint64_t inode)
+{
+	return m_inode_locks(inode);
+}
+
+bool
 BeesContext::scan_forward(const BeesFileRange &bfr_in)
 {
 	BEESTRACE("scan_forward " << bfr_in);
@@ -735,7 +690,7 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)
 	// Silently filter out blacklisted files
 	if (is_blacklisted(bfr_in.fid())) {
 		BEESCOUNT(scan_blacklisted);
-		return bfr_in;
+		return false;
 	}

 	// Reconstitute FD
@@ -749,31 +704,35 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)
 	if (!bfr.fd()) {
 		// BEESLOGINFO("No FD in " << root_path() << " for " << bfr);
 		BEESCOUNT(scan_no_fd);
-		return bfr;
+		return false;
 	}

 	// Sanity check
 	if (bfr.begin() >= bfr.file_size()) {
 		BEESLOGWARN("past EOF: " << bfr);
 		BEESCOUNT(scan_eof);
-		return bfr;
+		return false;
 	}

 	BtrfsExtentWalker ew(bfr.fd(), bfr.begin(), root_fd());

-	BeesFileRange return_bfr(bfr);
-
 	Extent e;
+	bool start_over = false;
 	catch_all([&]() {
-		while (!stop_requested()) {
+		while (!stop_requested() && !start_over) {
 			e = ew.current();

 			catch_all([&]() {
 				uint64_t extent_bytenr = e.bytenr();
-				BEESNOTE("waiting for extent bytenr " << to_hex(extent_bytenr));
-				auto extent_lock = m_extent_lock_set.make_lock(extent_bytenr);
+				auto extent_mutex = m_extent_locks(extent_bytenr);
+				const auto extent_lock = extent_mutex->try_lock(Task::current_task());
+				if (!extent_lock) {
+					// BEESLOGDEBUG("Deferring extent bytenr " << to_hex(extent_bytenr) << " from " << bfr);
+					BEESCOUNT(scanf_deferred_extent);
+					start_over = true;
+				}
 				Timer one_extent_timer;
-				return_bfr = scan_one_extent(bfr, e);
+				scan_one_extent(bfr, e);
 				BEESCOUNTADD(scanf_extent_ms, one_extent_timer.age() * 1000);
 				BEESCOUNT(scanf_extent);
 			});
@@ -791,48 +750,20 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)
 	BEESCOUNTADD(scanf_total_ms, scan_timer.age() * 1000);
 	BEESCOUNT(scanf_total);

-	return return_bfr;
+	return start_over;
 }

 BeesResolveAddrResult::BeesResolveAddrResult()
 {
 }

-void
-BeesContext::wait_for_balance()
+shared_ptr<BtrfsIoctlLogicalInoArgs>
+BeesContext::logical_ino(const uint64_t logical, const bool all_refs)
 {
-	if (!BEES_SERIALIZE_BALANCE) {
-		return;
-	}
-
-	Timer balance_timer;
-	BEESNOTE("WORKAROUND: waiting for balance to stop");
-	while (true) {
-		btrfs_ioctl_balance_args args {};
-		const int ret = ioctl(root_fd(), BTRFS_IOC_BALANCE_PROGRESS, &args);
-		if (ret < 0) {
-			// Either can't get balance status or not running, exit either way
-			break;
-		}
-
-		if (!(args.state & BTRFS_BALANCE_STATE_RUNNING)) {
-			// Balance not running, doesn't matter if paused or cancelled
-			break;
-		}
-
-		BEESLOGDEBUG("WORKAROUND: Waiting " << balance_timer << "s for balance to stop");
-		unique_lock<mutex> lock(m_abort_mutex);
-		if (m_abort_requested) {
-			// Force the calling function to stop.	We cannot
-			// proceed to LOGICAL_INO while balance is running
-			// until the bugs are fixed, and it's probably
-			// not going to be particularly fast to have
-			// both bees and balance banging the disk anyway.
-			BeesTracer::set_silent();
-			throw std::runtime_error("Stop requested while balance running");
-		}
-		m_abort_condvar.wait_for(lock, chrono::duration<double>(BEES_BALANCE_POLL_INTERVAL));
-	}
+	const auto rv = m_logical_ino_pool();
+	rv->set_logical(logical);
+	rv->set_flags(all_refs ? BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET : 0);
+	return rv;
 }

 BeesResolveAddrResult
@@ -846,19 +777,22 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
 	// transaction latency, competing threads, and freeze/SIGSTOP
 	// pausing the bees process.

-	// Wait for the balance to finish before we run LOGICAL_INO
-	wait_for_balance();
+	const auto log_ino_ptr = logical_ino(addr.get_physical_or_zero(), false);
+	auto &log_ino = *log_ino_ptr;

 	// Time how long this takes
 	Timer resolve_timer;

-        BtrfsIoctlLogicalInoArgs log_ino(addr.get_physical_or_zero());
-
-	// Get this thread's system CPU usage
 	struct rusage usage_before;
-	DIE_IF_MINUS_ONE(getrusage(RUSAGE_THREAD, &usage_before));
-
 	{
+		BEESNOTE("waiting to resolve addr " << addr << " with LOGICAL_INO");
+		const auto lock = MultiLocker::get_lock("logical_ino");
+
+		// Get this thread's system CPU usage
+		DIE_IF_MINUS_ONE(getrusage(RUSAGE_THREAD, &usage_before));
+
+		// Restart timer now that we're no longer waiting for lock
+		resolve_timer.reset();
 		BEESTOOLONG("Resolving addr " << addr << " in " << root_path() << " refs " << log_ino.m_iors.size());
 		BEESNOTE("resolving addr " << addr << " with LOGICAL_INO");
 		if (log_ino.do_ioctl_nothrow(root_fd())) {
@@ -887,8 +821,12 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)

 	// Avoid performance problems - pretend resolve failed if there are too many refs
 	const size_t rv_count = log_ino.m_iors.size();
+	if (!rv_count) {
+		BEESLOGDEBUG("LOGICAL_INO returned 0 refs at " << to_hex(addr));
+		BEESCOUNT(resolve_empty);
+	}
 	if (rv_count < BEES_MAX_EXTENT_REF_COUNT) {
-		rv.m_biors = log_ino.m_iors;
+		rv.m_biors = vector<BtrfsInodeOffsetRoot>(log_ino.m_iors.begin(), log_ino.m_iors.end());
 	} else {
 		BEESLOGINFO("addr " << addr << " refs " << rv_count << " overflows configured ref limit " << BEES_MAX_EXTENT_REF_COUNT);
 		BEESCOUNT(resolve_overflow);
@@ -898,7 +836,7 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
 	if (sys_usage_delta < BEES_TOXIC_SYS_DURATION) {
 		rv.m_is_toxic = false;
 	} else {
-		BEESLOGNOTICE("WORKAROUND: toxic address: addr = " << addr << ", sys_usage_delta = " << round(sys_usage_delta* 1000.0) / 1000.0 << ", user_usage_delta = " << round(user_usage_delta * 1000.0) / 1000.0 << ", rt_age = " << rt_age << ", refs " << rv_count);
+		BEESLOGDEBUG("WORKAROUND: toxic address: addr = " << addr << ", sys_usage_delta = " << round(sys_usage_delta* 1000.0) / 1000.0 << ", user_usage_delta = " << round(user_usage_delta * 1000.0) / 1000.0 << ", rt_age = " << rt_age << ", refs " << rv_count);
 		BEESCOUNT(resolve_toxic);
 		rv.m_is_toxic = true;
 	}
@@ -931,6 +869,14 @@ BeesContext::invalidate_addr(BeesAddress addr)
 	return m_resolve_cache.expire(addr.get_physical_or_zero());
 }

+void
+BeesContext::resolve_cache_clear()
+{
+	BEESNOTE("clearing resolve cache with size " << m_resolve_cache.size());
+	BEESLOGDEBUG("Clearing resolve cache with size " << m_resolve_cache.size());
+	return m_resolve_cache.clear();
+}
+
 void
 BeesContext::set_root_fd(Fd fd)
 {
@@ -950,18 +896,21 @@ BeesContext::set_root_fd(Fd fd)
 	});
 }

-const char *
-BeesHalt::what() const noexcept
-{
-	return "bees stop requested";
-}
-
 void
 BeesContext::start()
 {
 	BEESLOGNOTICE("Starting bees main loop...");
 	BEESNOTE("starting BeesContext");

+	m_extent_locks.func([](uint64_t bytenr) {
+		return make_shared<Exclusion>();
+		(void)bytenr;
+	});
+	m_inode_locks.func([](const uint64_t fid) {
+		return make_shared<Exclusion>();
+		(void)fid;
+	});
+	m_progress_thread = make_shared<BeesThread>("progress_report");
 	m_progress_thread = make_shared<BeesThread>("progress_report");
 	m_status_thread = make_shared<BeesThread>("status_report");
 	m_progress_thread->exec([=]() {
@@ -975,6 +924,9 @@ BeesContext::start()
 	m_tmpfile_pool.generator([=]() -> shared_ptr<BeesTempFile> {
 		return make_shared<BeesTempFile>(shared_from_this());
 	});
+	m_logical_ino_pool.generator([]() {
+		return make_shared<BtrfsIoctlLogicalInoArgs>(0);
+	});
 	m_tmpfile_pool.checkin([](const shared_ptr<BeesTempFile> &btf) {
 		catch_all([&](){
 			btf->reset();
@@ -996,17 +948,37 @@ BeesContext::stop()
 	Timer stop_timer;
 	BEESLOGNOTICE("Stopping bees...");

-	BEESNOTE("aborting blocked tasks");
-	BEESLOGDEBUG("Aborting blocked tasks");
-	unique_lock<mutex> abort_lock(m_abort_mutex);
-	m_abort_requested = true;
-	m_abort_condvar.notify_all();
-	abort_lock.unlock();
-
+	// Stop TaskConsumers without hurting the Task objects that carry the Crawl state
 	BEESNOTE("pausing work queue");
 	BEESLOGDEBUG("Pausing work queue");
-	TaskMaster::set_thread_count(0);
+	TaskMaster::pause();

+	// Stop crawlers first so we get good progress persisted on disk
+	BEESNOTE("stopping crawlers and flushing crawl state");
+	BEESLOGDEBUG("Stopping crawlers and flushing crawl state");
+	if (m_roots) {
+		m_roots->stop_request();
+	} else {
+		BEESLOGDEBUG("Crawlers not running");
+	}
+
+	BEESNOTE("stopping and flushing hash table");
+	BEESLOGDEBUG("Stopping and flushing hash table");
+	if (m_hash_table) {
+		m_hash_table->stop_request();
+	} else {
+		BEESLOGDEBUG("Hash table not running");
+	}
+
+	// Wait for crawler writeback to finish
+	BEESNOTE("waiting for crawlers to stop");
+	BEESLOGDEBUG("Waiting for crawlers to stop");
+	if (m_roots) {
+		m_roots->stop_wait();
+	}
+
+	// It is now no longer possible to update progress in $BEESHOME,
+	// so we can destroy Tasks with reckless abandon.
 	BEESNOTE("setting stop_request flag");
 	BEESLOGDEBUG("Setting stop_request flag");
 	unique_lock<mutex> lock(m_stop_mutex);
@@ -1014,46 +986,13 @@ BeesContext::stop()
 	m_stop_condvar.notify_all();
 	lock.unlock();

-	// Stop crawlers first so we get good progress persisted on disk
-	BEESNOTE("stopping crawlers");
-	BEESLOGDEBUG("Stopping crawlers");
-	if (m_roots) {
-		m_roots->stop();
-		m_roots.reset();
-	} else {
-		BEESLOGDEBUG("Crawlers not running");
-	}
-
-	BEESNOTE("cancelling work queue");
-	BEESLOGDEBUG("Cancelling work queue");
-	TaskMaster::cancel();
-
-	BEESNOTE("stopping hash table");
-	BEESLOGDEBUG("Stopping hash table");
+	// Wait for hash table flush to complete
+	BEESNOTE("waiting for hash table flush to stop");
+	BEESLOGDEBUG("waiting for hash table flush to stop");
 	if (m_hash_table) {
-		m_hash_table->stop();
-		m_hash_table.reset();
-	} else {
-		BEESLOGDEBUG("Hash table not running");
+		m_hash_table->stop_wait();
 	}

-	BEESNOTE("closing tmpfiles");
-	BEESLOGDEBUG("Closing tmpfiles");
-	m_tmpfile_pool.clear();
-
-	BEESNOTE("closing FD caches");
-	BEESLOGDEBUG("Closing FD caches");
-	if (m_fd_cache) {
-		m_fd_cache->clear();
-		BEESNOTE("destroying FD caches");
-		BEESLOGDEBUG("Destroying FD caches");
-		m_fd_cache.reset();
-	}
-
-	BEESNOTE("waiting for progress thread");
-	BEESLOGDEBUG("Waiting for progress thread");
-	m_progress_thread->join();
-
 	// Write status once with this message...
 	BEESNOTE("stopping status thread at " << stop_timer << " sec");
 	lock.lock();
@@ -1069,6 +1008,9 @@ BeesContext::stop()
 	m_status_thread->join();

 	BEESLOGNOTICE("bees stopped in " << stop_timer << " sec");
+
+	// Skip all destructors, do not pass GO, do not collect atexit() functions
+	_exit(EXIT_SUCCESS);
 }

 bool
@@ -1109,13 +1051,7 @@ shared_ptr<BeesTempFile>
 BeesContext::tmpfile()
 {
 	unique_lock<mutex> lock(m_stop_mutex);
-
-	if (m_stop_requested) {
-		throw BeesHalt();
-	}
-
 	lock.unlock();
-
 	return m_tmpfile_pool();
 }

@@ -1123,9 +1059,6 @@ shared_ptr<BeesFdCache>
 BeesContext::fd_cache()
 {
 	unique_lock<mutex> lock(m_stop_mutex);
-	if (m_stop_requested) {
-		throw BeesHalt();
-	}
 	if (!m_fd_cache) {
 		m_fd_cache = make_shared<BeesFdCache>(shared_from_this());
 	}
@@ -1136,9 +1069,6 @@ shared_ptr<BeesRoots>
 BeesContext::roots()
 {
 	unique_lock<mutex> lock(m_stop_mutex);
-	if (m_stop_requested) {
-		throw BeesHalt();
-	}
 	if (!m_roots) {
 		m_roots = make_shared<BeesRoots>(shared_from_this());
 	}
@@ -1149,9 +1079,6 @@ shared_ptr<BeesHashTable>
 BeesContext::hash_table()
 {
 	unique_lock<mutex> lock(m_stop_mutex);
-	if (m_stop_requested) {
-		throw BeesHalt();
-	}
 	if (!m_hash_table) {
 		m_hash_table = make_shared<BeesHashTable>(shared_from_this(), "beeshash.dat");
 	}
--- a/src/bees-hash.cc
+++ b/src/bees-hash.cc
@@ -106,12 +106,6 @@ BeesHashTable::flush_dirty_extent(uint64_t extent_index)
 	BEESNOTE("flushing extent #" << extent_index << " of " << m_extents << " extents");

 	auto lock = lock_extent_by_index(extent_index);
-
-	// Not dirty, nothing to do
-	if (!m_extent_metadata.at(extent_index).m_dirty) {
-		return false;
-	}
-
 	bool wrote_extent = false;

 	catch_all([&]() {
@@ -125,9 +119,6 @@ BeesHashTable::flush_dirty_extent(uint64_t extent_index)
 		// Copy the extent because we might be stuck writing for a while
 		ByteVector extent_copy(dirty_extent, dirty_extent_end);

-		// Mark extent non-dirty while we still hold the lock
-		m_extent_metadata.at(extent_index).m_dirty = false;
-
 		// Release the lock
 		lock.unlock();

@@ -139,6 +130,10 @@ BeesHashTable::flush_dirty_extent(uint64_t extent_index)
 		// const size_t dirty_extent_size   = dirty_extent_end - dirty_extent;
 		// bees_unreadahead(m_fd, dirty_extent_offset, dirty_extent_size);

+		// Mark extent clean if write was successful
+		lock.lock();
+		m_extent_metadata.at(extent_index).m_dirty = false;
+
 		wrote_extent = true;
 	});

@@ -152,25 +147,28 @@ BeesHashTable::flush_dirty_extents(bool slowly)

 	uint64_t wrote_extents = 0;
 	for (size_t extent_index = 0; extent_index < m_extents; ++extent_index) {
+		// Skip the clean ones
+		auto lock = lock_extent_by_index(extent_index);
+		if (!m_extent_metadata.at(extent_index).m_dirty) {
+			continue;
+		}
+		lock.unlock();
+
 		if (flush_dirty_extent(extent_index)) {
 			++wrote_extents;
 			if (slowly) {
+				if (m_stop_requested) {
+					slowly = false;
+					continue;
+				}
 				BEESNOTE("flush rate limited after extent #" << extent_index << " of " << m_extents << " extents");
 				chrono::duration<double> sleep_time(m_flush_rate_limit.sleep_time(BLOCK_SIZE_HASHTAB_EXTENT));
 				unique_lock<mutex> lock(m_stop_mutex);
-				if (m_stop_requested) {
-					BEESLOGDEBUG("Stop requested in hash table flush_dirty_extents");
-					// This function is called by another thread with !slowly,
-					// so we just get out of the way here.
-					break;
-				}
 				m_stop_condvar.wait_for(lock, sleep_time);
 			}
 		}
 	}
-	if (!slowly) {
-		BEESLOGINFO("Flushed " << wrote_extents << " of " << m_extents << " extents");
-	}
+	BEESLOGINFO("Flushed " << wrote_extents << " of " << m_extents << " hash table extents");
 	return wrote_extents;
 }

@@ -204,12 +202,27 @@ BeesHashTable::writeback_loop()
 			m_dirty_condvar.wait(lock);
 		}
 	}
+
+	// The normal loop exits at the end of one iteration when stop requested,
+	// but stop request will be in the middle of the loop, and some extents
+	// will still be dirty.  Run the flush loop again to get those.
+	BEESNOTE("flushing hash table, round 2");
+	BEESLOGDEBUG("Flushing hash table");
+	flush_dirty_extents(false);
+
+	// If there were any Tasks still running, they may have updated
+	// some hash table pages during the second flush.  These updates
+	// will be lost.  The Tasks will be repeated on the next run because
+	// they were not completed prior to the stop request, and the
+	// Crawl progress was already flushed out before the Hash table
+	// started writing, so nothing is really lost here.
+
 	catch_all([&]() {
 		// trigger writeback on our way out
 #if 0
 		// seems to trigger huge latency spikes
-		BEESTOOLONG("unreadahead hash table size " << pretty(m_size));
-		bees_unreadahead(m_fd, 0, m_size);
+		BEESTOOLONG("unreadahead hash table size " <<
+		pretty(m_size)); bees_unreadahead(m_fd, 0, m_size);
 #endif
 	});
 	BEESLOGDEBUG("Exited hash table writeback_loop");
@@ -794,7 +807,7 @@ BeesHashTable::~BeesHashTable()
 }

 void
-BeesHashTable::stop()
+BeesHashTable::stop_request()
 {
 	BEESNOTE("stopping BeesHashTable threads");
 	BEESLOGDEBUG("Stopping BeesHashTable threads");
@@ -808,7 +821,11 @@ BeesHashTable::stop()
 	unique_lock<mutex> dirty_lock(m_dirty_mutex);
 	m_dirty_condvar.notify_all();
 	dirty_lock.unlock();
+}

+void
+BeesHashTable::stop_wait()
+{
 	BEESNOTE("waiting for hash_prefetch thread");
 	BEESLOGDEBUG("Waiting for hash_prefetch thread");
 	m_prefetch_thread.join();
@@ -817,11 +834,5 @@ BeesHashTable::stop()
 	BEESLOGDEBUG("Waiting for hash_writeback thread");
 	m_writeback_thread.join();

-	if (m_cell_ptr && m_size) {
-		BEESLOGDEBUG("Flushing hash table");
-		BEESNOTE("flushing hash table");
-		flush_dirty_extents(false);
-	}
-
 	BEESLOGDEBUG("BeesHashTable stopped");
 }
--- a/src/bees-roots.cc
+++ b/src/bees-roots.cc
--- a/src/bees-trace.cc
+++ b/src/bees-trace.cc
@@ -111,9 +111,7 @@ void
 BeesNote::set_name(const string &name)
 {
 	tl_name = name;
-	catch_all([&]() {
-		DIE_IF_MINUS_ERRNO(pthread_setname_np(pthread_self(), name.c_str()));
-	});
+	pthread_setname(name);
 }

 string
@@ -134,19 +132,12 @@ BeesNote::get_name()
 	}

 	// OK try the pthread name next.
-	char buf[24];
-	memset(buf, '\0', sizeof(buf));
-	int err = pthread_getname_np(pthread_self(), buf, sizeof(buf));
-	if (err) {
-		return string("pthread_getname_np: ") + strerror(err);
-	}
-	buf[sizeof(buf) - 1] = '\0';

 	// thread_getname_np returns process name
 	// ...by default?  ...for the main thread?
 	// ...except during exception handling?
 	// ...randomly?
-	return buf;
+	return pthread_getname();
 }

 BeesNote::ThreadStatusMap
--- a/src/bees-types.cc
+++ b/src/bees-types.cc
@@ -238,42 +238,6 @@ BeesFileRange::overlaps(const BeesFileRange &that) const
 	return false;
 }

-bool
-BeesFileRange::coalesce(const BeesFileRange &that)
-{
-	// Let's define coalesce-with-null as identity,
-	// and coalesce-null-with-null as coalesced
-	if (!*this) {
-		operator=(that);
-		return true;
-	}
-	if (!that) {
-		return true;
-	}
-
-	// Can't coalesce different files
-	if (!is_same_file(that)) return false;
-
-	pair<uint64_t, uint64_t> a(m_begin, m_end);
-	pair<uint64_t, uint64_t> b(that.m_begin, that.m_end);
-
-	// range a starts lower than or equal b
-	if (b.first < a.first) {
-		swap(a, b);
-	}
-
-	// if b starts within a, they overlap
-	// (and the intersecting region is b.first..min(a.second, b.second))
-	// (and the union region is a.first..max(a.second, b.second))
-	if (b.first >= a.first && b.first < a.second) {
-		m_begin = a.first;
-		m_end = max(a.second, b.second);
-		return true;
-	}
-
-	return false;
-}
-
 BeesFileRange::operator BeesBlockData() const
 {
 	BEESTRACE("operator BeesBlockData " << *this);
--- a/src/bees.cc
+++ b/src/bees.cc
@@ -215,23 +215,12 @@ BeesTooLong::operator=(const func_type &f)
 }

 void
-bees_sync(int fd)
-{
-	Timer sync_timer;
-	BEESNOTE("syncing " << name_fd(fd));
-	BEESTOOLONG("syncing " << name_fd(fd));
-	DIE_IF_NON_ZERO(fsync(fd));
-	BEESCOUNT(sync_count);
-	BEESCOUNTADD(sync_ms, sync_timer.age() * 1000);
-}
-
-void
-bees_readahead(int const fd, off_t offset, size_t size)
+bees_readahead(int const fd, const off_t offset, const size_t size)
 {
 	Timer readahead_timer;
 	BEESNOTE("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
 	BEESTOOLONG("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
-#if 1
+#if 0
 	// In the kernel, readahead() is identical to posix_fadvise(..., POSIX_FADV_DONTNEED)
 	DIE_IF_NON_ZERO(readahead(fd, offset, size));
 #else
@@ -241,18 +230,20 @@ bees_readahead(int const fd, off_t offset, size_t size)
 	// and might discard the readahead request entirely,
 	// so it's maybe, *maybe*, worth doing both.
 	BEESNOTE("emulating readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
-	while (size) {
+	auto working_size = size;
+	auto working_offset = offset;
+	while (working_size) {
 		// don't care about multithreaded writes to this buffer--it is garbage anyway
 		static uint8_t dummy[BEES_READAHEAD_SIZE];
-		size_t this_read_size = min(size, sizeof(dummy));
+		const size_t this_read_size = min(working_size, sizeof(dummy));
 		// Ignore errors and short reads.  It turns out our size
 		// parameter isn't all that accurate, so we can't use
 		// the pread_or_die template.
-		(void)!pread(fd, dummy, this_read_size, offset);
+		(void)!pread(fd, dummy, this_read_size, working_offset);
 		BEESCOUNT(readahead_count);
 		BEESCOUNTADD(readahead_bytes, this_read_size);
-		offset += this_read_size;
-		size -= this_read_size;
+		working_offset += this_read_size;
+		working_size -= this_read_size;
 	}
 #endif
 	BEESCOUNTADD(readahead_ms, readahead_timer.age() * 1000);
@@ -481,7 +472,6 @@ BeesTempFile::make_copy(const BeesFileRange &src)
 	auto src_p = src.begin();
 	auto dst_p = begin;

-	bool did_block_write = false;
 	while (dst_p < end) {
 		auto len = min(BLOCK_SIZE_CLONE, end - dst_p);
 		BeesBlockData bbd(src.fd(), src_p, len);
@@ -492,7 +482,6 @@ BeesTempFile::make_copy(const BeesFileRange &src)
 			BEESNOTE("copying " << src << " to " << rv << "\n"
 				"\tpwrite " << bbd << " to " << name_fd(m_fd) << " offset " << to_hex(dst_p) << " len " << len);
 			pwrite_or_die(m_fd, bbd.data().data(), len, dst_p);
-			did_block_write = true;
 			BEESCOUNT(tmp_block);
 			BEESCOUNTADD(tmp_bytes, len);
 		}
@@ -501,16 +490,6 @@ BeesTempFile::make_copy(const BeesFileRange &src)
 	}
 	BEESCOUNTADD(tmp_copy_ms, copy_timer.age() * 1000);

-	if (did_block_write) {
-#if 0
-		// There were a lot of kernel bugs leading to lockups.
-		// Most of them are fixed now.
-		// Unnecessary sync makes us slow, but maybe it has some robustness utility.
-		// TODO:  make this configurable.
-		bees_sync(m_fd);
-#endif
-	}
-
 	BEESCOUNT(tmp_copy);
 	return rv;
 }
@@ -624,7 +603,7 @@ bees_main(int argc, char *argv[])
 	unsigned thread_min = 0;
 	double load_target = 0;
 	bool workaround_btrfs_send = false;
-	BeesRoots::ScanMode root_scan_mode = BeesRoots::SCAN_MODE_ZERO;
+	BeesRoots::ScanMode root_scan_mode = BeesRoots::SCAN_MODE_INDEPENDENT;

 	// Configure getopt_long
 	static const struct option long_options[] = {
@@ -795,8 +774,8 @@ main(int argc, char *argv[])
 		return EXIT_FAILURE;
 	}

-	int rv = 1;
-	catch_and_explain([&]() {
+	int rv = EXIT_FAILURE;
+	catch_all([&]() {
 		rv = bees_main(argc, argv);
 	});
 	BEESLOGNOTICE("Exiting with status " << rv << " " << (rv ? "(failure)" : "(success)"));
--- a/src/bees.h
+++ b/src/bees.h
@@ -1,6 +1,7 @@
 #ifndef BEES_H
 #define BEES_H

+#include "crucible/btrfs-tree.h"
 #include "crucible/cache.h"
 #include "crucible/chatter.h"
 #include "crucible/error.h"
@@ -8,6 +9,7 @@
 #include "crucible/fd.h"
 #include "crucible/fs.h"
 #include "crucible/lockset.h"
+#include "crucible/multilock.h"
 #include "crucible/pool.h"
 #include "crucible/progress.h"
 #include "crucible/time.h"
@@ -59,8 +61,9 @@ const off_t BLOCK_SIZE_HASHTAB_BUCKET = BLOCK_SIZE_MMAP;
 // Extent size for hash table (since the nocow file attribute does not seem to be working today)
 const off_t BLOCK_SIZE_HASHTAB_EXTENT = BLOCK_SIZE_MAX_COMPRESSED_EXTENT;

-// Bytes per second we want to flush (8GB every two hours)
-const double BEES_FLUSH_RATE = 8.0 * 1024 * 1024 * 1024 / 7200.0;
+// Bytes per second we want to flush from hash table
+// Optimistic sustained write rate for SD cards
+const double BEES_FLUSH_RATE = 128 * 1024;

 // Interval between writing crawl state to disk
 const int BEES_WRITEBACK_INTERVAL = 900;
@@ -98,20 +101,8 @@ const size_t BEES_MAX_EXTENT_REF_COUNT = (16 * 1024 * 1024 / 24) - 1;
 // How long between hash table histograms
 const double BEES_HASH_TABLE_ANALYZE_INTERVAL = BEES_STATS_INTERVAL;

-// Stop growing the work queue after we have this many tasks queued
-const size_t BEES_MAX_QUEUE_SIZE = 128;
-
-// Insert this many items before switching to a new subvol
-const size_t BEES_MAX_CRAWL_BATCH = 128;
-
-// Wait this many transids between crawls
-const size_t BEES_TRANSID_FACTOR = 10;
-
-// Wait this long for a balance to stop
-const double BEES_BALANCE_POLL_INTERVAL = 60.0;
-
-// Workaround for tree mod log bugs
-const bool BEES_SERIALIZE_BALANCE = false;
+// Wait at least this long for a new transid
+const double BEES_TRANSID_POLL_INTERVAL = 30.0;

 // Workaround for silly dedupe / ineffective readahead behavior
 const size_t BEES_READAHEAD_SIZE = 1024 * 1024;
@@ -282,35 +273,31 @@ public:
 	bool is_same_file(const BeesFileRange &that) const;
 	bool overlaps(const BeesFileRange &that) const;

-	// If file ranges overlap, extends this to include that.
-	// Coalesce with empty bfr = non-empty bfr
-	bool coalesce(const BeesFileRange &that);
-
-	// Remove that from this, creating 0, 1, or 2 new objects
-	pair<BeesFileRange, BeesFileRange> subtract(const BeesFileRange &that) const;
-
 	off_t begin() const { return m_begin; }
 	off_t end() const { return m_end; }
 	off_t size() const;

-	// Lazy accessors
+	/// @{ Lazy accessors
 	off_t file_size() const;
 	BeesFileId fid() const;
+	/// @}

-	// Get the fd if there is one
+	/// Get the fd if there is one
 	Fd fd() const;

-	// Get the fd, opening it if necessary
+	/// Get the fd, opening it if necessary
 	Fd fd(const shared_ptr<BeesContext> &ctx);

+	/// Copy the BeesFileId but not the Fd
 	BeesFileRange copy_closed() const;

-	// Is it defined?
+	/// Is it defined?
 	operator bool() const { return !!m_fd || m_fid; }

-	// Make range larger
+	/// @{ Make range larger
 	off_t grow_end(off_t delta);
 	off_t grow_begin(off_t delta);
+	/// @}

 friend ostream & operator<<(ostream &os, const BeesFileRange &bfr);
 };
@@ -422,12 +409,14 @@ public:
 	BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t size = BLOCK_SIZE_HASHTAB_EXTENT);
 	~BeesHashTable();

-	void stop();
+	void stop_request();
+	void stop_wait();

 	vector<Cell>	find_cell(HashType hash);
 	bool		push_random_hash_addr(HashType hash, AddrType addr);
 	void		erase_hash_addr(HashType hash, AddrType addr);
 	bool		push_front_hash_addr(HashType hash, AddrType addr);
+	bool            flush_dirty_extent(uint64_t extent_index);

 private:
 	string		m_filename;
@@ -487,7 +476,6 @@ private:
 	void fetch_missing_extent_by_index(uint64_t extent_index);
 	void set_extent_dirty_locked(uint64_t extent_index);
 	size_t flush_dirty_extents(bool slowly);
-	bool flush_dirty_extent(uint64_t extent_index);

 	size_t			hash_to_extent_index(HashType ht);
 	unique_lock<mutex>	lock_extent_by_hash(HashType ht);
@@ -516,43 +504,48 @@ class BeesCrawl {
 	shared_ptr<BeesContext>			m_ctx;

 	mutex					m_mutex;
-	set<BeesFileRange>			m_extents;
+	BtrfsTreeItem				m_next_extent_data;
 	bool					m_deferred = false;
 	bool					m_finished = false;

 	mutex					m_state_mutex;
 	ProgressTracker<BeesCrawlState>		m_state;

+	BtrfsTreeObjectFetcher			m_btof;
+
 	bool fetch_extents();
 	void fetch_extents_harder();
 	bool next_transid();
+	BeesFileRange bti_to_bfr(const BtrfsTreeItem &bti) const;

 public:
 	BeesCrawl(shared_ptr<BeesContext> ctx, BeesCrawlState initial_state);
 	BeesFileRange peek_front();
 	BeesFileRange pop_front();
-	ProgressTracker<BeesCrawlState>::ProgressHolder hold_state(const BeesFileRange &bfr);
+	ProgressTracker<BeesCrawlState>::ProgressHolder hold_state(const BeesCrawlState &bcs);
 	BeesCrawlState get_state_begin();
-	BeesCrawlState get_state_end();
+	BeesCrawlState get_state_end() const;
 	void set_state(const BeesCrawlState &bcs);
 	void deferred(bool def_setting);
 };

+class BeesScanMode;
+
 class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	shared_ptr<BeesContext>			m_ctx;

 	BeesStringFile				m_crawl_state_file;
 	map<uint64_t, shared_ptr<BeesCrawl>>	m_root_crawl_map;
 	mutex					m_mutex;
-	bool					m_crawl_dirty = false;
+	uint64_t				m_crawl_dirty = 0;
+	uint64_t				m_crawl_clean = 0;
 	Timer					m_crawl_timer;
 	BeesThread				m_crawl_thread;
 	BeesThread				m_writeback_thread;
 	RateEstimator				m_transid_re;
-	size_t					m_transid_factor = BEES_TRANSID_FACTOR;
-	Task					m_crawl_task;
 	bool					m_workaround_btrfs_send = false;
-	LRUCache<bool, uint64_t>		m_root_ro_cache;
+
+	shared_ptr<BeesScanMode>		m_scanner;

 	mutex					m_tmpfiles_mutex;
 	map<BeesFileId, Fd>			m_tmpfiles;
@@ -565,7 +558,6 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	void insert_root(const BeesCrawlState &bcs);
 	Fd open_root_nocache(uint64_t root);
 	Fd open_root_ino_nocache(uint64_t root, uint64_t ino);
-	bool is_root_ro_nocache(uint64_t root);
 	uint64_t transid_min();
 	uint64_t transid_max();
 	uint64_t transid_max_nocache();
@@ -581,41 +573,38 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
 	uint64_t next_root(uint64_t root = 0);
 	void current_state_set(const BeesCrawlState &bcs);
 	RateEstimator& transid_re();
-	size_t crawl_batch(shared_ptr<BeesCrawl> crawl);
+	bool crawl_batch(shared_ptr<BeesCrawl> crawl);
 	void clear_caches();
-	void insert_tmpfile(Fd fd);
-	void erase_tmpfile(Fd fd);

-friend class BeesFdCache;
 friend class BeesCrawl;
-friend class BeesTempFile;
+friend class BeesFdCache;
+friend class BeesScanMode;

 public:
 	BeesRoots(shared_ptr<BeesContext> ctx);
 	void start();
-	void stop();
+	void stop_request();
+	void stop_wait();
+
+	void insert_tmpfile(Fd fd);
+	void erase_tmpfile(Fd fd);

 	Fd open_root(uint64_t root);
 	Fd open_root_ino(uint64_t root, uint64_t ino);
 	Fd open_root_ino(const BeesFileId &bfi) { return open_root_ino(bfi.root(), bfi.ino()); }
 	bool is_root_ro(uint64_t root);

-	// TODO:  think of better names for these.
-	// or TODO:  do extent-tree scans instead
+	// TODO:  do extent-tree scans instead
 	enum ScanMode {
-		SCAN_MODE_ZERO,
-		SCAN_MODE_ONE,
-		SCAN_MODE_TWO,
+		SCAN_MODE_LOCKSTEP,
+		SCAN_MODE_INDEPENDENT,
+		SCAN_MODE_SEQUENTIAL,
+		SCAN_MODE_RECENT,
 		SCAN_MODE_COUNT, // must be last
 	};

 	void set_scan_mode(ScanMode new_mode);
 	void set_workaround_btrfs_send(bool do_avoid);
-
-private:
-	ScanMode m_scan_mode = SCAN_MODE_ZERO;
-	static string scan_mode_ntoa(ScanMode new_mode);
-
 };

 struct BeesHash {
@@ -718,19 +707,14 @@ struct BeesResolveAddrResult {
 	bool is_toxic() const { return m_is_toxic; }
 };

-struct BeesHalt : exception {
-	const char *what() const noexcept override;
-};
-
 class BeesContext : public enable_shared_from_this<BeesContext> {
-	shared_ptr<BeesContext>				m_parent_ctx;
-
 	Fd						m_home_fd;

 	shared_ptr<BeesFdCache>				m_fd_cache;
 	shared_ptr<BeesHashTable>			m_hash_table;
 	shared_ptr<BeesRoots>				m_roots;
 	Pool<BeesTempFile>				m_tmpfile_pool;
+	Pool<BtrfsIoctlLogicalInoArgs>			m_logical_ino_pool;

 	LRUCache<BeesResolveAddrResult, BeesAddress>	m_resolve_cache;

@@ -742,30 +726,25 @@ class BeesContext : public enable_shared_from_this<BeesContext> {

 	Timer						m_total_timer;

-	LockSet<uint64_t>				m_extent_lock_set;
+	NamedPtr<Exclusion, uint64_t>			m_extent_locks;
+	NamedPtr<Exclusion, uint64_t>			m_inode_locks;

 	mutable mutex					m_stop_mutex;
 	condition_variable				m_stop_condvar;
 	bool						m_stop_requested = false;
 	bool						m_stop_status = false;

-	mutable mutex					m_abort_mutex;
-	condition_variable				m_abort_condvar;
-	bool						m_abort_requested = false;
-
 	shared_ptr<BeesThread>				m_progress_thread;
 	shared_ptr<BeesThread>				m_status_thread;

 	void set_root_fd(Fd fd);

 	BeesResolveAddrResult resolve_addr_uncached(BeesAddress addr);
-	void wait_for_balance();

 	BeesFileRange scan_one_extent(const BeesFileRange &bfr, const Extent &e);
 	void rewrite_file_range(const BeesFileRange &bfr);

 public:
-	BeesContext() = default;

 	void set_root_path(string path);

@@ -773,7 +752,9 @@ public:
 	Fd home_fd();
 	string root_path() const { return m_root_path; }

-	BeesFileRange scan_forward(const BeesFileRange &bfr);
+	bool scan_forward(const BeesFileRange &bfr);
+
+	shared_ptr<BtrfsIoctlLogicalInoArgs> logical_ino(uint64_t bytenr, bool all_refs);

 	bool is_root_ro(uint64_t root);
 	BeesRangePair dup_extent(const BeesFileRange &src, const shared_ptr<BeesTempFile> &tmpfile);
@@ -783,8 +764,11 @@ public:
 	void blacklist_erase(const BeesFileId &fid);
 	bool is_blacklisted(const BeesFileId &fid) const;

+	shared_ptr<Exclusion> get_inode_mutex(uint64_t inode);
+
 	BeesResolveAddrResult resolve_addr(BeesAddress addr);
 	void invalidate_addr(BeesAddress addr);
+	void resolve_cache_clear();

 	void dump_status();
 	void show_progress();
@@ -799,7 +783,6 @@ public:
 	shared_ptr<BeesTempFile> tmpfile();

 	const Timer &total_timer() const { return m_total_timer; }
-	LockSet<uint64_t> &extent_lock_set() { return m_extent_lock_set; }
 };

 class BeesResolver {
@@ -884,7 +867,6 @@ extern const char *BEES_USAGE;
 extern const char *BEES_VERSION;
 extern thread_local default_random_engine bees_generator;
 string pretty(double d);
-void bees_sync(int fd);
 void bees_readahead(int fd, off_t offset, size_t size);
 void bees_unreadahead(int fd, off_t offset, size_t size);
 string format_time(time_t t);
--- a/src/fiemap.cc
+++ b/src/fiemap.cc
@@ -1,55 +0,0 @@
-#include "crucible/fd.h"
-#include "crucible/fs.h"
-#include "crucible/error.h"
-#include "crucible/string.h"
-
-#include <iostream>
-
-#include <fcntl.h>
-#include <sys/stat.h>
-#include <unistd.h>
-
-using namespace crucible;
-using namespace std;
-
-int
-main(int argc, char **argv)
-{
-	catch_all([&]() {
-		THROW_CHECK1(invalid_argument, argc, argc > 1);
-		string filename = argv[1];
-
-	
-		cout << "File: " << filename << endl;
-		Fd fd = open_or_die(filename, O_RDONLY);
-		uint64_t start = 0;
-		uint64_t length = Fiemap::s_fiemap_max_offset;
-		if (argc > 2) { start = stoull(argv[2], nullptr, 0); }
-		if (argc > 3) { length = stoull(argv[3], nullptr, 0); }
-		length = min(length, Fiemap::s_fiemap_max_offset - start);
-		Fiemap fm(start, length);
-		fm.m_flags &= ~(FIEMAP_FLAG_SYNC);
-		fm.m_max_count = 100;
-		if (argc > 4) { fm.m_flags = stoull(argv[4], nullptr, 0); }
-		uint64_t stop_at = start + length;
-		uint64_t last_byte = start;
-		do {
-			fm.do_ioctl(fd);
-			// cerr << fm;
-			uint64_t last_logical = Fiemap::s_fiemap_max_offset;
-			for (auto &extent : fm.m_extents) {
-				if (extent.fe_logical > last_byte) {
-					cout << "Log " << to_hex(last_byte) << ".." << to_hex(extent.fe_logical) << " Hole" << endl;
-				}
-				cout << "Log " << to_hex(extent.fe_logical) << ".." << to_hex(extent.fe_logical + extent.fe_length)
-					<< " Phy " << to_hex(extent.fe_physical) << ".." << to_hex(extent.fe_physical + extent.fe_length)
-					<< " Flags " << fiemap_extent_flags_ntoa(extent.fe_flags) << endl;
-				last_logical = extent.fe_logical + extent.fe_length;
-				last_byte = last_logical;
-			}
-			fm.m_start = last_logical;
-		} while (fm.m_start < stop_at);
-	});
-	exit(EXIT_SUCCESS);
-}
-
--- a/src/fiewalk.cc
+++ b/src/fiewalk.cc
@@ -1,40 +0,0 @@
-#include "crucible/extentwalker.h"
-#include "crucible/error.h"
-#include "crucible/string.h"
-
-#include <iostream>
-
-#include <fcntl.h>
-#include <unistd.h>
-
-using namespace crucible;
-using namespace std;
-
-int
-main(int argc, char **argv)
-{
-	catch_all([&]() {
-		THROW_CHECK1(invalid_argument, argc, argc > 1);
-		string filename = argv[1];
-
-		cout << "File: " << filename << endl;
-		Fd fd = open_or_die(filename, O_RDONLY);
-		BtrfsExtentWalker ew(fd);
-		off_t pos = 0;
-		if (argc > 2) { pos = stoull(argv[2], nullptr, 0); }
-		ew.seek(pos);
-		do {
-			// cout << "\n\n>>>" << ew.current() << "<<<\n\n" << endl;
-			cout << ew.current() << endl;
-		} while (ew.next());
-#if 0
-		cout << "\n\n\nAnd now, backwards...\n\n\n" << endl;
-		do {
-			cout << "\n\n>>>" << ew.current() << "<<<\n\n" << endl;
-		} while (ew.prev());
-		cout << "\n\n\nDone!\n\n\n" << endl;
-#endif
-	});
-	exit(EXIT_SUCCESS);
-}
-
--- a/test/Makefile
+++ b/test/Makefile
@@ -7,6 +7,7 @@ PROGRAMS = \
 	path \
 	process \
 	progress \
+	seeker \
 	task \

 all: test
@@ -20,17 +21,10 @@ include ../makeflags
 LIBS = -lcrucible -lpthread
 BEES_LDFLAGS = -L../lib $(LDFLAGS)

-.depends:
-	mkdir -p $@
-
-.depends/%.dep: %.cc tests.h Makefile | .depends
+%.dep: %.cc tests.h Makefile
 	$(CXX) $(BEES_CXXFLAGS) -M -MF $@ -MT $(<:.cc=.o) $<

-depends.mk: $(PROGRAMS:%=.depends/%.dep)
-	cat $^ > $@.new
-	mv -f $@.new $@
-
-include depends.mk
+include $(PROGRAMS:%=%.dep)

 $(PROGRAMS:%=%.o): %.o: %.cc ../makeflags Makefile
 	$(CXX) $(BEES_CXXFLAGS) -o $@ -c $<
--- a/test/limits.cc
+++ b/test/limits.cc
@@ -3,6 +3,7 @@
 #include "crucible/limits.h"

 #include <cassert>
+#include <cstdint>

 using namespace crucible;

--- a/test/progress.cc
+++ b/test/progress.cc
@@ -12,23 +12,49 @@ using namespace std;
 void
 test_progress()
 {
+	// On create, begin == end == constructor argument
 	ProgressTracker<uint64_t> pt(123);
-	auto hold = pt.hold(234);
-	auto hold2 = pt.hold(345);
 	assert(pt.begin() == 123);
-	assert(pt.end() == 345);
-	auto hold3 = pt.hold(456);
-	assert(pt.begin() == 123);
-	assert(pt.end() == 456);
-	hold2.reset();
-	assert(pt.begin() == 123);
-	assert(pt.end() == 456);
-	hold.reset();
+	assert(pt.end() == 123);
+
+	// Holding a position past the end increases the end (and moves begin to match)
+	auto hold345 = pt.hold(345);
 	assert(pt.begin() == 345);
+	assert(pt.end() == 345);
+
+	// Holding a position before begin reduces begin, without changing end
+	auto hold234 = pt.hold(234);
+	assert(pt.begin() == 234);
+	assert(pt.end() == 345);
+
+	// Holding a position past the end increases the end, without affecting begin
+	auto hold456 = pt.hold(456);
+	assert(pt.begin() == 234);
 	assert(pt.end() == 456);
-	hold3.reset();
+
+	// Releasing a position in the middle affects neither begin nor end
+	hold345.reset();
+	assert(pt.begin() == 234);
+	assert(pt.end() == 456);
+
+	// Hold another position in the middle to test begin moving forward
+	auto hold400 = pt.hold(400);
+
+	// Releasing a position at the beginning moves begin forward
+	hold234.reset();
+	assert(pt.begin() == 400);
+	assert(pt.end() == 456);
+
+	// Releasing a position at the end doesn't move end backward
+	hold456.reset();
+	assert(pt.begin() == 400);
+	assert(pt.end() == 456);
+
+	// Releasing a position in the middle doesn't move end backward but does move begin forward
+	hold400.reset();
 	assert(pt.begin() == 456);
 	assert(pt.end() == 456);
+
 }

 int
--- a/test/seeker.cc
+++ b/test/seeker.cc
@@ -0,0 +1,101 @@
+#include "tests.h"
+
+#include "crucible/seeker.h"
+
+#include <set>
+#include <vector>
+
+#include <unistd.h>
+
+using namespace crucible;
+
+static
+set<uint64_t>
+seeker_finder(const vector<uint64_t> &vec, uint64_t lower, uint64_t upper)
+{
+	set<uint64_t> s(vec.begin(), vec.end());
+	auto lb = s.lower_bound(lower);
+	auto ub = lb;
+	if (ub != s.end()) ++ub;
+	if (ub != s.end()) ++ub;
+	for (; ub != s.end(); ++ub) {
+		if (*ub > upper) break;
+	}
+	return set<uint64_t>(lb, ub);
+}
+
+static bool test_fails = false;
+
+static
+void
+seeker_test(const vector<uint64_t> &vec, uint64_t const target)
+{
+	cerr << "Find " << target << " in {";
+	for (auto i : vec) {
+		cerr << " " << i;
+	}
+	cerr << " } = ";
+	size_t loops = 0;
+	bool excepted = catch_all([&]() {
+		auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
+			++loops;
+			return seeker_finder(vec, lower, upper);
+		});
+		cerr << found;
+		uint64_t my_found = 0;
+		for (auto i : vec) {
+			if (i <= target) {
+				my_found = i;
+			}
+		}
+		if (found == my_found) {
+			cerr << " (correct)";
+		} else {
+			cerr << " (INCORRECT - right answer is " << my_found << ")";
+			test_fails = true;
+		}
+	});
+	cerr << " (" << loops << " loops)" << endl;
+	if (excepted) {
+		test_fails = true;
+	}
+}
+
+static
+void
+test_seeker()
+{
+	seeker_test(vector<uint64_t> { 0, 1, 2, 3, 4, 5 }, 3);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 3, 4, 5 }, 5);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 3, 4, 5 }, 0);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 3, 4, 5 }, 1);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 3, 4, 5 }, 4);
+	seeker_test(vector<uint64_t> { 0, 1, 2, 3, 4, 5 }, 2);
+
+	seeker_test(vector<uint64_t> { 11, 22, 33, 44, 55 }, 2);
+	seeker_test(vector<uint64_t> { 11, 22, 33, 44, 55 }, 25);
+	seeker_test(vector<uint64_t> { 11, 22, 33, 44, 55 }, 52);
+	seeker_test(vector<uint64_t> { 11, 22, 33, 44, 55 }, 99);
+	seeker_test(vector<uint64_t> { 11, 22, 33, 44, 55, 56 }, 99);
+	seeker_test(vector<uint64_t> { 11, 22, 33, 44, 55 }, 1);
+	seeker_test(vector<uint64_t> { 11, 22, 33, 44, 55 }, 55);
+	seeker_test(vector<uint64_t> { 11 }, 55);
+	seeker_test(vector<uint64_t> { 11 }, 10);
+	seeker_test(vector<uint64_t> { 55 }, 55);
+	seeker_test(vector<uint64_t> { }, 55);
+	seeker_test(vector<uint64_t> { 55 }, numeric_limits<uint64_t>::max());
+	seeker_test(vector<uint64_t> { 55 }, numeric_limits<uint64_t>::max() - 1);
+	seeker_test(vector<uint64_t> { }, numeric_limits<uint64_t>::max());
+	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max());
+	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max() - 1);
+	seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() - 1 }, numeric_limits<uint64_t>::max());
+}
+
+
+int main(int, const char **)
+{
+
+	RUN_A_TEST(test_seeker());
+
+	return test_fails ? EXIT_FAILURE : EXIT_SUCCESS;
+}
--- a/test/task.cc
+++ b/test/task.cc
@@ -90,47 +90,51 @@ test_barrier(size_t count)

 	mutex mtx;
 	condition_variable cv;
+	bool done_flag = false;

 	unique_lock<mutex> lock(mtx);

-	auto b = make_shared<Barrier>();
+	Barrier b;

 	// Run several tasks in parallel
 	for (size_t c = 0; c < count; ++c) {
-		auto bl = b->lock();
 		ostringstream oss;
 		oss << "task #" << c;
+		auto b_hold = b;
 		Task t(
 			oss.str(),
-			[c, &task_done, &mtx, bl]() mutable {
-				// cerr << "Task #" << c << endl;
+			[c, &task_done, &mtx, b_hold]() mutable {
+				// ostringstream oss;
+				// oss << "Task #" << c << endl;
 				unique_lock<mutex> lock(mtx);
+				// cerr << oss.str();
 				task_done.at(c) = true;
-				bl.release();
+				b_hold.release();
 			}
 		);
 		t.run();
 	}

+	// Need completed to go out of local scope so it will release b
+	{
+		Task completed(
+			"Waiting for Barrier",
+			[&mtx, &cv, &done_flag]() {
+				unique_lock<mutex> lock(mtx);
+				// cerr << "Running cv notify" << endl;
+				done_flag = true;
+				cv.notify_all();
+			}
+		);
+		b.insert_task(completed);
+	}
+
 	// Get current status
-	ostringstream oss;
-	TaskMaster::print_queue(oss);
-	TaskMaster::print_workers(oss);
+	// TaskMaster::print_queue(cerr);
+	// TaskMaster::print_workers(cerr);

-	bool done_flag = false;
-
-	Task completed(
-		"Waiting for Barrier",
-		[&mtx, &cv, &done_flag]() {
-			unique_lock<mutex> lock(mtx);
-			// cerr << "Running cv notify" << endl;
-			done_flag = true;
-			cv.notify_all();
-		}
-	);
-	b->insert_task(completed);
-
-	b.reset();
+	// Release our b
+	b.release();

 	while (true) {
 		size_t tasks_done = 0;
@@ -139,7 +143,7 @@ test_barrier(size_t count)
 				++tasks_done;
 			}
 		}
-		// cerr << "Tasks done: " << tasks_done << " done_flag " << done_flag << endl;
+		cerr << "Tasks done: " << tasks_done << " done_flag " << done_flag << endl;
 		if (tasks_done == count && done_flag) {
 			break;
 		}
@@ -153,7 +157,7 @@ void
 test_exclusion(size_t count)
 {
 	mutex only_one;
-	auto excl = make_shared<Exclusion>("test_excl");
+	auto excl = make_shared<Exclusion>();

 	mutex mtx;
 	condition_variable cv;
@@ -174,9 +178,8 @@ test_exclusion(size_t count)
 			[c, &only_one, excl, &lock_success_count, &lock_failure_count, &pings, &tasks_running, &cv, &mtx]() mutable {
 				// cerr << "Task #" << c << endl;
 				(void)c;
-				auto lock = excl->try_lock();
+				auto lock = excl->try_lock(Task::current_task());
 				if (!lock) {
-					excl->insert_task(Task::current_task());
 					++lock_failure_count;
 					return;
 				}
@@ -196,7 +199,7 @@ test_exclusion(size_t count)
 		t.run();
 	}

-	// excl.reset();
+	excl.reset();

 	unique_lock<mutex> lock(mtx);
 	while (tasks_running) {