task: fixes for priority and idle Tasks

Tasks are not allowed to be queued more than once, but it is allowed to queue a Task while it's already running, which means a Task can be executed on two threads in parallel. Tasks detect this and handle it by queueing the Task on its own post-exec queue. That in turn leads to Workers which continually execute the same Task if that Task doesn't create any new Tasks, while other Tasks sit on the Master queue waiting for a Worker to dequeue them. For idle Tasks, we don't want the Task to be rescheduled immediately. We want the idle Task to execute again after every available Task on both the main and idle queues has been executed. Fix these by having each Task reschedule itself on the appropriate queue when it finishes executing. Priority queued Tasks should executed in priority order not just one Task's post-exec queue, but the entire local queue of the TaskConsumer. Fix this by moving the sort into either the TaskConsumer that receives a post-exec queue, if there is one, or into the Task that is created to insert the post-exec queue into a TaskConsumer when one becomes available in the future. Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Revert "roots: use a non-idle task for next_transid"
2025-08-02 05:43:29 +02:00 · 2025-01-15 00:43:25 -05:00 · 2025-01-12 18:48:33 -05:00 · 2025-01-12 15:28:26 -05:00 · 2025-01-12 14:05:44 -05:00 · 2025-01-12 00:35:37 -05:00
18 changed files with 349 additions and 240 deletions
--- a/README.md
+++ b/README.md
@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
 About bees
 ----------

-bees is a block-oriented userspace deduplication agent designed for large
-btrfs filesystems.  It is an offline dedupe combined with an incremental
-data scan capability to minimize time data spends on disk from write
-to dedupe.
+bees is a block-oriented userspace deduplication agent designed to scale
+up to large btrfs filesystems.  It is an offline dedupe combined with
+an incremental data scan capability to minimize time data spends on disk
+from write to dedupe.

 Strengths
 ---------

- * Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
- * Daemon incrementally dedupes new data using btrfs tree search
+ * Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
+ * Daemon mode - incrementally dedupes new data as it appears
+ * Largest extents first - recover more free space during fixed maintenance windows
 * Works with btrfs compression - dedupe any combination of compressed and uncompressed files
- * Works around btrfs filesystem structure to free more disk space
+ * Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
 * Persistent hash table for rapid restart after shutdown
- * Whole-filesystem dedupe - including snapshots
 * Constant hash table size - no increased RAM usage if data set becomes larger
 * Works on live data - no scheduled downtime required
- * Automatic self-throttling based on system load
+ * Automatic self-throttling - reduces system load
+ * btrfs support - recovers more free space from btrfs than naive dedupers

 Weaknesses
 ----------

 * Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
- * Requires root privilege (or `CAP_SYS_ADMIN`)
- * First run may require temporary disk space for extent reorganization
+ * Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
 * [First run may increase metadata space usage if many snapshots exist](docs/gotchas.md)
 * Constant hash table size - no decreased RAM usage if data set becomes smaller
 * btrfs only
@@ -46,7 +46,7 @@ Recommended Reading
 -------------------

 * [bees Gotchas](docs/gotchas.md)
- * [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING
+ * [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
 * [bees vs. other btrfs features](docs/btrfs-other.md)
 * [What to do when something goes wrong](docs/wrong.md)

@@ -69,6 +69,6 @@ You can also use Github:
 Copyright & License
 -------------------

-Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
+Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.

 GPL (version 3 or later).
--- a/docs/btrfs-kernel.md
+++ b/docs/btrfs-kernel.md
@@ -1,31 +1,24 @@
-Recommended Kernel Version for bees
-===================================
+Recommended Linux Kernel Version for bees
+=========================================

-First, a warning that is not specific to bees:
+First, a warning about old Linux kernel versions:

-> **Kernel 5.1, 5.2, and 5.3 should not be used with btrfs due to a
-severe regression that can lead to fatal metadata corruption.**
-This issue is fixed in kernel 5.4.14 and later.
+> **Linux kernel version 5.1, 5.2, and 5.3 should not be used with btrfs
+due to a severe regression that can lead to fatal metadata corruption.**
+This issue is fixed in version 5.4.14 and later.

-**Recommended kernel versions for bees are 4.19, 5.4, 5.10, 5.11, 5.15,
-6.0, or 6.1, with recent LTS and -stable updates.**  The latest released
-kernel as of this writing is 6.4.1.
+**Recommended Linux kernel versions for bees are 5.4, 5.10, 5.15, 6.1,
+6.6, or 6.12 with recent LTS and -stable updates.**  The latest released
+kernel as of this writing is 6.12.9, and the earliest supported LTS
+kernel is 5.4.

-4.14, 4.9, and 4.4 LTS kernels with recent updates are OK with some
-issues.  Older kernels will be slower (a little slower or a lot slower
-depending on which issues are triggered).  Not all fixes are backported.
-
-Obsolete non-LTS kernels have a variety of unfixed issues and should
-not be used with btrfs.  For details see the table below.
-
-bees requires btrfs kernel API version 4.2 or higher, and does not work
-at all on older kernels.
-
-Some bees features rely on kernel 4.15 to work, and these features will
-not be available on older kernels.  Currently, bees is still usable on
-older kernels with degraded performance or with options disabled, but
-support for older kernels may be removed.
+Some optional bees features use kernel APIs introduced in kernel 4.15
+(extent scan) and 5.6 (`openat2` support).  These bees features are not
+available on older kernels.  Support for older kernels may be removed
+in a future bees release.

+bees will not run at all on kernels before 4.2 due to lack of minimal
+API support.



@@ -71,7 +64,7 @@ These bugs are particularly popular among bees users, though not all are specifi
 | 6.3, backported to 5.15.107, 6.1.24, 6.2.11 | 6.3 | vmalloc error, failed to allocate pages | 6.3.10, 6.4 and later.  Bug (f349b15e183d "mm: vmalloc: avoid warn_alloc noise caused by fatal signal" in v6.3-rc6) backported to 6.1.24, 6.2.11, and 5.15.107. | 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
 | 6.2 | 6.3 | `IGNORE_OFFSET` flag ignored in `LOGICAL_INO` ioctl | 6.2.16, 6.3.3, 6.4 and later | 0cad8f14d70c btrfs: fix backref walking not returning all inode refs
 | 6.10 | 6.11 | `adding refs to an existing tree ref`, `failed to run delayed ref`, then read-only | 6.11.10, 6.12 and later | 7d493a5ecc26 btrfs: fix incorrect comparison for delayed refs
-| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe ioctl on the same extent | - | workaround: avoid doing that
+| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe/clone ioctl on the same extent | - | workaround: avoid doing that

 "Last bad kernel" refers to that version's last stable update from
 kernel.org.  Distro kernels may backport additional fixes.  Consult
@@ -97,12 +90,12 @@ contains the last committed component of the fix.
 Workarounds for known kernel bugs
 ---------------------------------

-* **Hangs with concurrent `LOGICAL_INO` and dedupe**:  on all
-  kernel versions so far, multiple threads running `LOGICAL_INO`
-  and dedupe ioctls at the same time on the same inodes or extents
+* **Hangs with concurrent `LOGICAL_INO` and dedupe/clone**:  on all
+  kernel versions so far, multiple threads running `LOGICAL_INO` and
+  dedupe/clone ioctls at the same time on the same inodes or extents
  can lead to a kernel hang.  The kernel enters an infinite loop in
  `add_all_parents`, where `count` is 0, `ref->count` is 1, and
-  `btrfs_next_item` or `btrfs_next_old_item` never find a matching ref).
+  `btrfs_next_item` or `btrfs_next_old_item` never find a matching ref.

  bees has two workarounds for this bug: 1. schedule work so that multiple
  threads do not simultaneously access the same inode or the same extent,
@@ -123,58 +116,32 @@ Workarounds for known kernel bugs

  It is still theoretically possible to trigger the kernel bug when
  running bees at the same time as other dedupers, or other programs
-  that use `LOGICAL_INO` like `btdu`; however, it's extremely difficult
-  to reproduce the bug without closely cooperating threads.
+  that use `LOGICAL_INO` like `btdu`, or when performing a reflink clone
+  operation such as `cp` or `mv`; however, it's extremely difficult to
+  reproduce the bug without closely cooperating threads.

-* **Slow backrefs** (aka toxic extents):  Under certain conditions,
-  if the number of references to a single shared extent grows too
-  high, the kernel consumes more and more CPU while also holding locks
-  that delay write access to the filesystem.  bees avoids this bug
-  by measuring the time the kernel spends performing `LOGICAL_INO`
-  operations and permanently blacklisting any extent or hash involved
-  where the kernel starts to get slow.  In the bees log, such blocks
-  are labelled as 'toxic' hash/block addresses.  Toxic extents are
-  rare (about 1 in 100,000 extents become toxic), but toxic extents can
-  become 8 orders of magnitude more expensive to process than the fastest
-  non-toxic extents.  This seems to affect all dedupe agents on btrfs;
-  at this time of writing only bees has a workaround for this bug.
+* **Slow backrefs** (aka toxic extents):  On older kernels, under certain
+  conditions, if the number of references to a single shared extent grows
+  too high, the kernel consumes more and more CPU while also holding
+  locks that delay write access to the filesystem.  This is no longer
+  a concern on kernels after 5.7 (or an up-to-date 5.4 LTS version),
+  but there are still some remains of earlier workarounds for this issue
+  in bees that have not been fully removed.

-  This workaround is less necessary for kernels 5.4.96, 5.7 and later,
-  though the bees workaround can still be triggered on newer kernels
-  by changes in btrfs since kernel version 5.1.
+  bees avoided this bug by measuring the time the kernel spends performing
+  `LOGICAL_INO` operations and permanently blacklisting any extent or
+  hash involved where the kernel starts to get slow.  In the bees log,
+  such blocks are labelled as 'toxic' hash/block addresses.
+
+  Future bees releases will remove toxic extent detection (it only detects
+  false positives now) and clear all previously saved toxic extent bits.

 * **dedupe breaks `btrfs send` in old kernels**.  The bees option
  `--workaround-btrfs-send` prevents any modification of read-only subvols
-  in order to avoid breaking `btrfs send`.
+  in order to avoid breaking `btrfs send` on kernels before 5.2.

-  This workaround is no longer necessary to avoid kernel crashes
-  and send performance failure on kernel 4.9.207, 4.14.159, 4.19.90,
-  5.3.17, 5.4.4, 5.5 and later; however, some conflict between send
-  and dedupe still remains, so the workaround is still useful.
+  This workaround is no longer necessary to avoid kernel crashes and
+  send performance failure on kernel 5.4.4 and later.  bees will pause
+  dedupe until the send is finished on current kernels.

  `btrfs receive` is not and has never been affected by this issue.
-
-Unfixed kernel bugs
-------------------
-
-* **The kernel does not permit `btrfs send` and dedupe to run at the
-  same time**.  Recent kernels no longer crash, but now refuse one
-  operation with an error if the other operation was already running.
-
-  bees has not been updated to handle the new dedupe behavior optimally.
-  Optimal behavior is to defer dedupe operations when send is detected,
-  and resume after the send is finished.  Current bees behavior is to
-  complain loudly about each individual dedupe failure in log messages,
-  and abandon duplicate data references in the snapshot that send is
-  processing.  A future bees version shall have better handling for
-  this situation.
-
-  Workaround:  send `SIGSTOP` to bees, or terminate the bees process,
-  before running `btrfs send`.
-
-  This workaround is not strictly required if snapshot is deleted after
-  sending.  In that case, any duplicate data blocks that were not removed
-  by dedupe will be removed by snapshot delete instead.  The workaround
-  still saves some IO.
-
-  `btrfs receive` is not affected by this issue.
--- a/docs/btrfs-other.md
+++ b/docs/btrfs-other.md
@@ -3,40 +3,34 @@ Good Btrfs Feature Interactions

 bees has been tested in combination with the following:

-* btrfs compression (zlib, lzo, zstd), mixtures of compressed and uncompressed extents
+* btrfs compression (zlib, lzo, zstd)
 * PREALLOC extents (unconditionally replaced with holes)
 * HOLE extents and btrfs no-holes feature
-* Other deduplicators, reflink copies (though bees may decide to redo their work)
-* btrfs snapshots and non-snapshot subvols (RW and RO)
+* Other deduplicators (`duperemove`, `jdupes`)
+* Reflink copies (modern coreutils `cp` and `mv`)
 * Concurrent file modification (e.g. PostgreSQL and sqlite databases, VMs, build daemons)
-* All btrfs RAID profiles
-* IO errors during dedupe (read errors will throw exceptions, bees will catch them and skip over the affected extent)
-* Filesystems mounted with or without the `flushoncommit` option
+* All btrfs RAID profiles:  single, dup, raid0, raid1, raid10, raid1c3, raid1c4, raid5, raid6
+* IO errors during dedupe (affected extents are skipped)
 * 4K filesystem data block size / clone alignment
 * 64-bit and 32-bit LE host CPUs (amd64, x86, arm)
 * Large files (kernel 5.4 or later strongly recommended)
-* Filesystems up to 90T+ bytes, 1000M+ files
+* Filesystem data sizes up to 100T+ bytes, 1000M+ files
+* `open(O_DIRECT)` (seems to work as well--or as poorly--with bees as with any other btrfs feature)
+* btrfs-convert from ext2/3/4
+* btrfs `autodefrag` mount option
+* btrfs balance (data balances cause rescan of relocated data)
+* btrfs block-group-tree
+* btrfs `flushoncommit` and `noflushoncommit` mount options
+* btrfs mixed block groups
+* btrfs `nodatacow`/`nodatasum` inode attribute or mount option (bees skips all nodatasum files)
+* btrfs qgroups and quota support (_not_ squotas)
 * btrfs receive
-* btrfs nodatacow/nodatasum inode attribute or mount option (bees skips all nodatasum files)
-* open(O_DIRECT) (seems to work as well--or as poorly--with bees as with any other btrfs feature)
-* lvm dm-cache, writecache
+* btrfs scrub
+* btrfs send (dedupe pauses automatically, kernel 5.4 or later required)
+* btrfs snapshot, non-snapshot subvols (RW and RO), snapshot delete

-Bad Btrfs Feature Interactions
------------------------------
-
-bees has been tested in combination with the following, and various problems are known:
-
-* btrfs send:  there are bugs in `btrfs send` that can be triggered by
-  bees on old kernels.  The [`--workaround-btrfs-send` option](options.md)
-  works around this issue by preventing bees from modifying read-only
-  snapshots.
-
-* btrfs qgroups:  very slow, sometimes hangs...and it's even worse when
-  bees is running.
-
-* btrfs autodefrag mount option:  bees cannot distinguish autodefrag
-  activity from normal filesystem activity, and may try to undo the
-  autodefrag if duplicate copies of the defragmented data exist.
+**Note:** some btrfs features have minimum kernel versions which are
+higher than the minimum kernel version for bees.

 Untested Btrfs Feature Interactions
 -----------------------------------
@@ -45,10 +39,6 @@ bees has not been tested with the following, and undesirable interactions may oc

 * Non-4K filesystem data block size (should work if recompiled)
 * Non-equal hash (SUM) and filesystem data block (CLONE) sizes (need to fix that eventually)
-* btrfs seed filesystems (no particular reason it wouldn't work, but no one has reported trying)
-* btrfs out-of-tree kernel patches (e.g. in-kernel dedupe, encryption, extent tree v2)
-* btrfs-convert from ext2/3/4 (never tested, might run out of space or ignore significant portions of the filesystem due to sanity checks)
-* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
+* btrfs seed filesystems, raid-stripe-tree, squotas (no particular reason these wouldn't work, but no one has reported trying)
+* btrfs out-of-tree kernel patches (e.g. encryption, extent tree v2)
 * Host CPUs with exotic page sizes, alignment requirements, or endianness (ppc, alpha, sparc, strongarm, s390, mips, m68k...)
-* bcache: used to be in the "bad" list, now in the "untested" list because nobody is rigorously testing, and bcache bugs come and go
-* flashcache: an out-of-tree cache-HDD-on-SSD block layer helper
--- a/docs/config.md
+++ b/docs/config.md
@@ -26,11 +26,7 @@ Here are some numbers to estimate appropriate hash table sizes:
 Notes:

 * If the hash table is too large, no extra dedupe efficiency is
-obtained, and the extra space wastes RAM.  If the hash table contains
-more block records than there are blocks in the filesystem, the extra
-space can slow bees down.  A table that is too large prevents obsolete
-data from being evicted, so bees wastes time looking for matching data
-that is no longer present on the filesystem.
+obtained, and the extra space wastes RAM.

 * If the hash table is too small, bees extrapolates from matching
 blocks to find matching adjacent blocks in the filesystem that have been
@@ -59,19 +55,19 @@ patterns on dedupe effectiveness without performing deep inspection of
 both the filesystem data and its structure--a task that is as expensive
 as performing the deduplication.

-* **Compression** on the filesystem reduces the average extent length
-compared to uncompressed filesystems.  The maximum compressed extent
-length on btrfs is 128KB, while the maximum uncompressed extent length
-is 128MB.  Longer extents decrease the optimum hash table size while
-shorter extents increase the optimum hash table size because the
-probability of a hash table entry being present (i.e. unevicted) in
-each extent is proportional to the extent length.
+* **Compression** in files reduces the average extent length compared
+to uncompressed files.  The maximum compressed extent length on
+btrfs is 128KB, while the maximum uncompressed extent length is 128MB.
+Longer extents decrease the optimum hash table size while shorter extents
+increase the optimum hash table size, because the probability of a hash
+table entry being present (i.e. unevicted) in each extent is proportional
+to the extent length.

   As a rule of thumb, the optimal hash table size for a compressed
 filesystem is 2-4x larger than the optimal hash table size for the same
-data on an uncompressed filesystem.  Dedupe efficiency falls dramatically
-with hash tables smaller than 128MB/TB as the average dedupe extent size
-is larger than the largest possible compressed extent size (128KB).
+data on an uncompressed filesystem.  Dedupe efficiency falls rapidly with
+hash tables smaller than 128MB/TB as the average dedupe extent size is
+larger than the largest possible compressed extent size (128KB).

 * **Short writes or fragmentation** also shorten the average extent
 length and increase optimum hash table size.  If a database writes to
@@ -115,7 +111,6 @@ Extent scan mode:
 * Works with 4.15 and later kernels.
 * Can estimate progress and provide an ETA.
 * Can optimize scanning order to dedupe large extents first.
- * Cannot avoid modifying read-only subvols.
 * Can keep up with frequent creation and deletion of snapshots.

 Subvol scan modes:
@@ -123,8 +118,7 @@ Subvol scan modes:
 * Work with 4.14 and earlier kernels.
 * Cannot estimate or report progress.
 * Cannot optimize scanning order by extent size.
- * Can avoid modifying read-only subvols (for `btrfs send` workaround).
- * Have problems keeping up with snapshots created during a scan.
+ * Have problems keeping up with multiple snapshots created during a scan.

 The default scan mode is 4, "extent".

@@ -212,7 +206,7 @@ Extent scan mode
 Scan mode 4, "extent", scans the extent tree instead of the subvol trees.
 Extent scan mode reads each extent once, regardless of the number of
 reflinks or snapshots.  It adapts to the creation of new snapshots
-immediately, without having to revisit old data.
+and reflinks immediately, without having to revisit old data.

 In the extent scan mode, extents are separated into multiple size tiers
 to prioritize large extents over small ones.  Deduping large extents
@@ -268,17 +262,54 @@ send` in extent scan mode, and restart bees after the `send` is complete.
 Threads and load management
 ---------------------------

-By default, bees creates one worker thread for each CPU detected.
-These threads then perform scanning and dedupe operations.  The number of
-worker threads can be set with the [`--thread-count` and `--thread-factor`
-options](options.md).
+By default, bees creates one worker thread for each CPU detected.  These
+threads then perform scanning and dedupe operations.  bees attempts to
+maximize the amount of productive work each thread does, until either the
+threads are all continuously busy, or there is no remaining work to do.

-If desired, bees can automatically increase or decrease the number
-of worker threads in response to system load.  This reduces impact on
-the rest of the system by pausing bees when other CPU and IO intensive
-loads are active on the system, and resumes bees when the other loads
-are inactive.  This is configured with the [`--loadavg-target` and
-`--thread-min` options](options.md).
+In many cases it is not desirable to continually run bees at maximum
+performance.  Maximum performance is not necessary if bees can dedupe
+new data faster than it appears on the filesystem.  If it only takes
+bees 10 minutes per day to dedupe all new data on a filesystem, then
+bees doesn't need to run for more than 10 minutes per day.
+
+bees supports a number of options for reducing system load:
+
+ * Run bees for a few hours per day, at an off-peak time (i.e. during
+ a maintenace window), instead of running bees continuously.  Any data
+ added to the filesystem while bees is not running will be scanned when
+ bees restarts.  At the end of the maintenance window, terminate the
+ bees process with SIGTERM to write the hash table and scan position
+ for the next maintenance window.
+
+ * Temporarily pause bees operation by sending the bees process SIGUSR1,
+ and resume operation with SIGUSR2.  This is preferable to freezing
+ and thawing the process, e.g. with freezer cgroups or SIGSTOP/SIGCONT
+ signals, because it allows bees to close open file handles that would
+ otherwise prevent those files from being deleted while bees is frozen.
+
+ * Reduce the number of worker threads with the [`--thread-count` or
+`--thread-factor` options](options.md).  This simply leaves CPU cores
+ idle so that other applications on the host can use them, or to save
+ power.
+
+ * Allow bees to automatically track system load and increase or decrease
+ the number of threads to reach a target system load.  This reduces
+ impact on the rest of the system by pausing bees when other CPU and IO
+ intensive loads are active on the system, and resumes bees when the other
+ loads are inactive.  This is configured with the [`--loadavg-target`
+ and `--thread-min` options](options.md).
+
+ * Allow bees to self-throttle operations that enqueue delayed work
+ within btrfs.  These operations are not well controlled by Linux
+ features such as process priority or IO priority or IO rate-limiting,
+ because the enqueued work is submitted to btrfs several seconds before
+ btrfs performs the work.  By the time btrfs performs the work, it's too
+ late for external throttling to be effective.  The [`--throttle-factor`
+ option](options.md) tracks how long it takes btrfs to complete queued
+ operations, and reduces bees's queued work submission rate to match
+ btrfs's queued work completion rate (or a fraction thereof, to reduce
+ system load).

 Log verbosity
 -------------
--- a/docs/index.md
+++ b/docs/index.md
@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
 About bees
 ----------

-bees is a block-oriented userspace deduplication agent designed for large
-btrfs filesystems.  It is an offline dedupe combined with an incremental
-data scan capability to minimize time data spends on disk from write
-to dedupe.
+bees is a block-oriented userspace deduplication agent designed to scale
+up to large btrfs filesystems.  It is an offline dedupe combined with
+an incremental data scan capability to minimize time data spends on disk
+from write to dedupe.

 Strengths
 ---------

- * Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
- * Daemon incrementally dedupes new data using btrfs tree search
+ * Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
+ * Daemon mode - incrementally dedupes new data as it appears
+ * Largest extents first - recover more free space during fixed maintenance windows
 * Works with btrfs compression - dedupe any combination of compressed and uncompressed files
- * Works around btrfs filesystem structure to free more disk space
+ * Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
 * Persistent hash table for rapid restart after shutdown
- * Whole-filesystem dedupe - including snapshots
 * Constant hash table size - no increased RAM usage if data set becomes larger
 * Works on live data - no scheduled downtime required
- * Automatic self-throttling based on system load
+ * Automatic self-throttling - reduces system load
+ * btrfs support - recovers more free space from btrfs than naive dedupers

 Weaknesses
 ----------

 * Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
- * Requires root privilege (or `CAP_SYS_ADMIN`)
- * First run may require temporary disk space for extent reorganization
+ * Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
 * [First run may increase metadata space usage if many snapshots exist](gotchas.md)
 * Constant hash table size - no decreased RAM usage if data set becomes smaller
 * btrfs only
@@ -46,7 +46,7 @@ Recommended Reading
 -------------------

 * [bees Gotchas](gotchas.md)
- * [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING
+ * [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
 * [bees vs. other btrfs features](btrfs-other.md)
 * [What to do when something goes wrong](wrong.md)

@@ -69,6 +69,6 @@ You can also use Github:
 Copyright & License
 -------------------

-Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
+Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.

 GPL (version 3 or later).
--- a/docs/options.md
+++ b/docs/options.md
@@ -84,19 +84,22 @@

 * `--workaround-btrfs-send` or `-a`

+ _This option is obsolete and should not be used any more._
+
 Pretend that read-only snapshots are empty and silently discard any
-request to dedupe files referenced through them.  This is a workaround for
-[problems with the kernel implementation of `btrfs send` and `btrfs send
+request to dedupe files referenced through them.  This is a workaround
+for [problems with old kernels running `btrfs send` and `btrfs send
 -p`](btrfs-kernel.md) which make these btrfs features unusable with bees.

- This option should be used to avoid breaking `btrfs send` on the same
-filesystem.
+ This option was used to avoid breaking `btrfs send` on old kernels.
+ The affected kernels are now too old to be recommended for use with bees.
+
+ bees now waits for `btrfs send` to finish.  There is no need for an
+ option to enable this.

 **Note:** There is a _significant_ space tradeoff when using this option:
 it is likely no space will be recovered--and possibly significant extra
-space used--until the read-only snapshots are deleted.  On the other
-hand, if snapshots are rotated frequently then bees will spend less time
-scanning them.
+space used--until the read-only snapshots are deleted.

 ## Logging options

--- a/docs/wrong.md
+++ b/docs/wrong.md
@@ -4,16 +4,13 @@ What to do when something goes wrong with bees
 Hangs and excessive slowness
 ----------------------------

-### Are you using qgroups or autodefrag?
-
-  Read about [bad btrfs feature interactions](btrfs-other.md).
-
 ### Use load-throttling options

  If bees is just more aggressive than you would like, consider using
  [load throttling options](options.md).  These are usually more effective
  than `ionice`, `schedtool`, and the `blkio` cgroup (though you can
-  certainly use those too).
+  certainly use those too) because they limit work that bees queues up
+  for later execution inside btrfs.

 ### Check `$BEESSTATUS`

@@ -52,10 +49,6 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li

 Thread names of note:

- * `crawl_12345`: scan/dedupe worker threads (the number is the subvol
-   ID which the thread is currently working on).  These threads appear
-   and disappear from the status dynamically according to the requirements
-   of the work queue and loadavg throttling.
 * `bees`: main thread (doesn't do anything after startup, but its task execution time is that of the whole bees process)
 * `crawl_master`: task that finds new extents in the filesystem and populates the work queue
 * `crawl_transid`: btrfs transid (generation number) tracker and polling thread
@@ -64,6 +57,13 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li
 * `hash_writeback`: trickle-writes the hash table back to `beeshash.dat`
 * `hash_prefetch`: prefetches the hash table at startup and updates `beesstats.txt` hourly

+Most other threads have names that are derived from the current dedupe
+task that they are executing:
+
+ * `ref_205ad76b1000_24K_50`:  extent scan performing dedupe of btrfs extent bytenr `205ad76b1000`, which is 24 KiB long and has 50 references
+ * `extent_250_32M_16E`:  extent scan searching for extents between 32 MiB + 1 and 16 EiB bytes long, tracking scan position in virtual subvol `250`.
+ * `crawl_378_18916`:  subvol scan searching for extent refs in subvol `378`, inode `18916`.
+
 ### Dump kernel stacks of hung processes

 Check the kernel stacks of all blocked kernel processes:
@@ -91,7 +91,7 @@ bees Crashes
        (gdb) thread apply all bt full

  The last line generates megabytes of output and will often crash gdb.
-  This is OK, submit whatever output gdb can produce.
+  Submit whatever output gdb can produce.

  **Note that this output may include filenames or data from your
  filesystem.**
@@ -160,8 +160,7 @@ Kernel crashes, corruption, and filesystem damage
 -------------------------------------------------

 bees doesn't do anything that _should_ cause corruption or data loss;
-however, [btrfs has kernel bugs](btrfs-kernel.md) and [interacts poorly
-with some Linux block device layers](btrfs-other.md), so corruption is
+however, [btrfs has kernel bugs](btrfs-kernel.md), so corruption is
 not impossible.

 Issues with the btrfs filesystem kernel code or other block device layers
--- a/include/crucible/lockset.h
+++ b/include/crucible/lockset.h
@@ -117,7 +117,7 @@ namespace crucible {
 		while (full() || locked(name)) {
 			m_condvar.wait(lock);
 		}
-		auto rv = m_set.insert(make_pair(name, crucible::gettid()));
+		auto rv = m_set.insert(make_pair(name, gettid()));
 		THROW_CHECK0(runtime_error, rv.second);
 	}

@@ -129,7 +129,7 @@ namespace crucible {
 		if (full() || locked(name)) {
 			return false;
 		}
-		auto rv = m_set.insert(make_pair(name, crucible::gettid()));
+		auto rv = m_set.insert(make_pair(name, gettid()));
 		THROW_CHECK1(runtime_error, name, rv.second);
 		return true;
 	}
--- a/include/crucible/openat2.h
+++ b/include/crucible/openat2.h
@@ -0,0 +1,17 @@
+#ifndef CRUCIBLE_OPENAT2_H
+#define CRUCIBLE_OPENAT2_H
+
+#include <linux/openat2.h>
+
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+extern "C" {
+
+/// Weak symbol to support libc with no syscall wrapper
+int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size) throw();
+
+};
+
+#endif // CRUCIBLE_OPENAT2_H
--- a/include/crucible/process.h
+++ b/include/crucible/process.h
@@ -10,6 +10,10 @@
 #include <sys/wait.h>
 #include <unistd.h>

+extern "C" {
+	pid_t gettid() throw();
+};
+
 namespace crucible {
 	using namespace std;

@@ -73,7 +77,6 @@ namespace crucible {

 	typedef ResourceHandle<Process::id, Process> Pid;

-	pid_t gettid();
 	double getloadavg1();
 	double getloadavg5();
 	double getloadavg15();
--- a/include/crucible/task.h
+++ b/include/crucible/task.h
@@ -47,6 +47,10 @@ namespace crucible {
 		/// been destroyed.
 		void append(const Task &task) const;

+		/// Schedule Task to run after this Task has run or
+		/// been destroyed, in Task ID order.
+		void insert(const Task &task) const;
+
 		/// Describe Task as text.
 		string title() const;

@@ -172,9 +176,6 @@ namespace crucible {
 		/// objects it holds, and exit its Task function.
 		ExclusionLock try_lock(const Task &task);

-		/// Execute Task when Exclusion is unlocked (possibly
-		/// immediately).
-		void insert_task(const Task &t);
 	};

 	/// Wrapper around pthread_setname_np which handles length limits
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -14,6 +14,7 @@ CRUCIBLE_OBJS = \
 	fs.o \
 	multilock.o \
 	ntoa.o \
+	openat2.o \
 	path.o \
 	process.o \
 	string.o \
--- a/lib/chatter.cc
+++ b/lib/chatter.cc
@@ -76,7 +76,7 @@ namespace crucible {
 			DIE_IF_ZERO(strftime(buf, sizeof(buf), "%Y-%m-%d %H:%M:%S", &ltm));

 			header_stream << buf;
-			header_stream << " " << getpid() << "." << crucible::gettid();
+			header_stream << " " << getpid() << "." << gettid();
 			if (add_prefix_level) {
 				header_stream << "<" << m_loglevel << ">";
 			}
@@ -88,7 +88,7 @@ namespace crucible {
 				header_stream << "<" << m_loglevel << ">";
 			}
 			header_stream << (m_name.empty() ? "thread" : m_name);
-			header_stream << "[" << crucible::gettid() << "]";
+			header_stream << "[" << gettid() << "]";
 		}

 		header_stream << ": ";
--- a/lib/openat2.cc
+++ b/lib/openat2.cc
@@ -0,0 +1,13 @@
+#include "crucible/openat2.h"
+
+extern "C" {
+
+int
+__attribute__((weak))
+openat2(int const dirfd, const char *const pathname, struct open_how *const how, size_t const size)
+throw()
+{
+	return syscall(SYS_openat2, dirfd, pathname, how, size);
+}
+
+};
--- a/lib/process.cc
+++ b/lib/process.cc
@@ -7,13 +7,18 @@
 #include <cstdlib>
 #include <utility>

-// for gettid()
-#ifndef _GNU_SOURCE
-#define _GNU_SOURCE
-#endif
 #include <unistd.h>
 #include <sys/syscall.h>

+extern "C" {
+	pid_t
+	__attribute__((weak))
+	gettid() throw()
+	{
+		return syscall(SYS_gettid);
+	}
+};
+
 namespace crucible {
 	using namespace std;

@@ -111,12 +116,6 @@ namespace crucible {
 		}
 	}

-	pid_t
-	gettid()
-	{
-		return syscall(SYS_gettid);
-	}
-
 	double
 	getloadavg1()
 	{
--- a/lib/task.cc
+++ b/lib/task.cc
@@ -76,13 +76,24 @@ namespace crucible {
 		/// Tasks to be executed after the current task is executed
 		list<TaskStatePtr>			m_post_exec_queue;

-		/// Set by run() and append().  Cleared by exec().
+		/// Set by run(), append(), and insert().  Cleared by exec().
 		bool					m_run_now = false;

+		/// Set by insert().  Cleared by exec() and destructor.
+		bool					m_sort_queue = false;
+
 		/// Set when task starts execution by exec().
 		/// Cleared when exec() ends.
 		bool					m_is_running = false;

+		/// Set when task is queued while already running.
+		/// Cleared when task is requeued.
+		bool					m_run_again = false;
+
+		/// Set when task is queued as idle task while already running.
+		/// Cleared when task is queued as non-idle task.
+		bool					m_idle = false;
+
 		/// Sequential identifier for next task
 		static atomic<TaskId>			s_next_id;

@@ -107,7 +118,7 @@ namespace crucible {
 		static void clear_queue(TaskQueue &tq);

 		/// Rescue any TaskQueue, not just this one.
-		static void rescue_queue(TaskQueue &tq);
+		static void rescue_queue(TaskQueue &tq, const bool sort_queue);

 		TaskState &operator=(const TaskState &) = delete;
 		TaskState(const TaskState &) = delete;
@@ -142,6 +153,10 @@ namespace crucible {
 		/// or is destroyed.
 		void append(const TaskStatePtr &task);

+		/// Queue task to execute after current task finishes executing
+		/// or is destroyed, in task ID order.
+		void insert(const TaskStatePtr &task);
+
 		/// How masy Tasks are there?  Good for catching leaks
 		static size_t instance_count();
 	};
@@ -219,16 +234,21 @@ namespace crucible {
 	static auto s_tms = make_shared<TaskMasterState>();

 	void
-	TaskState::rescue_queue(TaskQueue &queue)
+	TaskState::rescue_queue(TaskQueue &queue, const bool sort_queue)
 	{
 		if (queue.empty()) {
 			return;
 		}
-		const auto tlcc = tl_current_consumer;
+		const auto &tlcc = tl_current_consumer;
 		if (tlcc) {
 			// We are executing under a TaskConsumer, splice our post-exec queue at front.
 			// No locks needed because we are using only thread-local objects.
 			tlcc->m_local_queue.splice(tlcc->m_local_queue.begin(), queue);
+			if (sort_queue) {
+				tlcc->m_local_queue.sort([&](const TaskStatePtr &a, const TaskStatePtr &b) {
+					return a->m_id < b->m_id;
+				});
+			}
 		} else {
 			// We are not executing under a TaskConsumer.
 			// If there is only one task, then just insert it at the front of the queue.
@@ -239,6 +259,8 @@ namespace crucible {
 				// then push it to the front of the global queue using normal locking methods.
 				TaskStatePtr rescue_task(make_shared<TaskState>("rescue_task", [](){}));
 				swap(rescue_task->m_post_exec_queue, queue);
+				// Do the sort--once--when a new Consumer has picked up the Task
+				rescue_task->m_sort_queue = sort_queue;
 				TaskQueue tq_one { rescue_task };
 				TaskMasterState::push_front(tq_one);
 			}
@@ -251,7 +273,8 @@ namespace crucible {
 		--s_instance_count;
 		unique_lock<mutex> lock(m_mutex);
 		// If any dependent Tasks were appended since the last exec, run them now
-		TaskState::rescue_queue(m_post_exec_queue);
+		TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
+		// No need to clear m_sort_queue here, it won't exist soon
 	}

 	TaskState::TaskState(string title, function<void()> exec_fn) :
@@ -310,6 +333,24 @@ namespace crucible {
 			task->m_run_now = true;
 			append_nolock(task);
 		}
+		task->m_idle = false;
+	}
+
+	void
+	TaskState::insert(const TaskStatePtr &task)
+	{
+		THROW_CHECK0(invalid_argument, task);
+		THROW_CHECK2(invalid_argument, m_id, task->m_id, m_id != task->m_id);
+		PairLock lock(m_mutex, task->m_mutex);
+		if (!task->m_run_now) {
+			task->m_run_now = true;
+			// Move the task and its post-exec queue to follow this task,
+			// and request a sort of the flattened list.
+			m_sort_queue = true;
+			m_post_exec_queue.push_back(task);
+			m_post_exec_queue.splice(m_post_exec_queue.end(), task->m_post_exec_queue);
+		}
+		task->m_idle = false;
 	}

 	void
@@ -320,7 +361,7 @@ namespace crucible {

 		unique_lock<mutex> lock(m_mutex);
 		if (m_is_running) {
-			append_nolock(shared_from_this());
+			m_run_again = true;
 			return;
 		} else {
 			m_run_now = false;
@@ -344,8 +385,20 @@ namespace crucible {
 		swap(this_task, tl_current_task);
 		m_is_running = false;

+		if (m_run_again) {
+			m_run_again = false;
+			if (m_idle) {
+				// All the way back to the end of the line
+				TaskMasterState::push_back_idle(shared_from_this());
+			} else {
+				// Insert after any dependents waiting for this Task
+				m_post_exec_queue.push_back(shared_from_this());
+			}
+		}
+
 		// Splice task post_exec queue at front of local queue
-		TaskState::rescue_queue(m_post_exec_queue);
+		TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
+		m_sort_queue = false;
 	}

 	string
@@ -365,22 +418,32 @@ namespace crucible {
 	TaskState::run()
 	{
 		unique_lock<mutex> lock(m_mutex);
+		m_idle = false;
 		if (m_run_now) {
 			return;
 		}
 		m_run_now = true;
-		TaskMasterState::push_back(shared_from_this());
+		if (m_is_running) {
+			m_run_again = true;
+		} else {
+			TaskMasterState::push_back(shared_from_this());
+		}
 	}

 	void
 	TaskState::idle()
 	{
 		unique_lock<mutex> lock(m_mutex);
+		m_idle = true;
 		if (m_run_now) {
 			return;
 		}
 		m_run_now = true;
-		TaskMasterState::push_back_idle(shared_from_this());
+		if (m_is_running) {
+			m_run_again = true;
+		} else {
+			TaskMasterState::push_back_idle(shared_from_this());
+		}
 	}

 	TaskMasterState::TaskMasterState(size_t thread_max) :
@@ -740,6 +803,14 @@ namespace crucible {
 		m_task_state->append(that.m_task_state);
 	}

+	void
+	Task::insert(const Task &that) const
+	{
+		THROW_CHECK0(runtime_error, m_task_state);
+		THROW_CHECK0(runtime_error, that);
+		m_task_state->insert(that.m_task_state);
+	}
+
 	Task
 	Task::current_task()
 	{
@@ -854,11 +925,13 @@ namespace crucible {
 		swap(this_consumer, tl_current_consumer);
 		assert(!tl_current_consumer);

-		// Release lock to rescue queue (may attempt to queue a new task at TaskMaster).
-		// rescue_queue normally sends tasks to the local queue of the current TaskConsumer thread,
-		// but we just disconnected ourselves from that.
+		// Release lock to rescue queue (may attempt to queue a
+		// new task at TaskMaster).  rescue_queue normally sends
+		// tasks to the local queue of the current TaskConsumer
+		// thread, but we just disconnected ourselves from that.
+		// No sorting here because this is not a TaskState.
 		lock.unlock();
-		TaskState::rescue_queue(m_local_queue);
+		TaskState::rescue_queue(m_local_queue, false);

 		// Hold lock so we can erase ourselves
 		lock.lock();
@@ -936,21 +1009,6 @@ namespace crucible {
 		m_owner.reset();
 	}

-	void
-	Exclusion::insert_task(const Task &task)
-	{
-		unique_lock<mutex> lock(m_mutex);
-		const auto sp = m_owner.lock();
-		lock.unlock();
-		if (sp) {
-			// If Exclusion is locked then queue task for release;
-			sp->append(task);
-		} else {
-			// otherwise, run the inserted task immediately
-			task.run();
-		}
-	}
-
 	ExclusionLock
 	Exclusion::try_lock(const Task &task)
 	{
@@ -958,7 +1016,7 @@ namespace crucible {
 		const auto sp = m_owner.lock();
 		if (sp) {
 			if (task) {
-				sp->append(task);
+				sp->insert(task);
 			}
 			return ExclusionLock();
 		} else {
--- a/src/bees-roots.cc
+++ b/src/bees-roots.cc
@@ -3,6 +3,7 @@
 #include "crucible/btrfs-tree.h"
 #include "crucible/cache.h"
 #include "crucible/ntoa.h"
+#include "crucible/openat2.h"
 #include "crucible/string.h"
 #include "crucible/table.h"
 #include "crucible/task.h"
@@ -130,7 +131,7 @@ BeesScanMode::start_scan()
 			st->scan();
 		});
 	}
-	m_scan_task.run();
+	m_scan_task.idle();
 }

 bool
@@ -768,7 +769,7 @@ BeesScanModeExtent::scan()

 	// Good to go, start everything running
 	for (const auto &i : task_map_copy) {
-		i.second.run();
+		i.second.idle();
 	}
 }

@@ -901,7 +902,7 @@ BeesScanModeExtent::map_next_extent(uint64_t const subvol)
 				<< " time " << crawl_time << " subvol " << subvol);
 		}

-		// We did something!  Get in line to run again (but don't preempt work already queued)
+		// We did something!  Get in line to run again
 		Task::current_task().idle();
 		return;
 	}
@@ -1024,6 +1025,7 @@ BeesScanModeExtent::next_transid(const CrawlMap &crawl_map_unused)
 	});
 	const auto dash_fill = Table::Fill('-');
 	eta.insert_row(1, vector<Table::Content>(eta.cols().size(), dash_fill));
+	const auto now = time(NULL);
 	for (const auto &i : s_magic_crawl_map) {
 		const auto &subvol = i.first;
 		const auto &magic = i.second;
@@ -1063,7 +1065,6 @@ BeesScanModeExtent::next_transid(const CrawlMap &crawl_map_unused)
 		}
 		const auto bytenr_offset = min(bi_last_bytenr, max(bytenr, bi.first_bytenr)) - bi.first_bytenr + bi.first_total;
 		const auto bytenr_norm = bytenr_offset / double(fs_size);
-		const auto now = time(NULL);
 		const auto time_so_far = now - min(now, this_state.m_started);
 		const string start_stamp = strf_localtime(this_state.m_started);
 		string eta_stamp = "-";
@@ -1101,8 +1102,8 @@ BeesScanModeExtent::next_transid(const CrawlMap &crawl_map_unused)
 		Table::Text("gen_now"),
 		Table::Number(m_roots->transid_max()),
 		Table::Text(""),
-		Table::Text(""),
-		Table::Text(""),
+		Table::Text("updated"),
+		Table::Text(strf_localtime(now)),
 	});
 	eta.left("");
 	eta.mid(" ");
@@ -1758,6 +1759,32 @@ BeesRoots::stop_wait()
 	BEESLOGDEBUG("BeesRoots stopped");
 }

+static
+Fd
+bees_openat(int const parent_fd, const char *const pathname, uint64_t const flags)
+{
+	// Never O_CREAT so we don't need a mode argument
+	THROW_CHECK1(invalid_argument, flags, (flags & O_CREAT) == 0);
+
+	// Try openat2 if the kernel has it
+	static bool can_openat2 = true;
+	if (can_openat2) {
+		open_how how {
+			.flags = flags,
+			.resolve = RESOLVE_BENEATH | RESOLVE_NO_SYMLINKS | RESOLVE_NO_XDEV,
+		};
+		const auto rv = openat2(parent_fd, pathname, &how, sizeof(open_how));
+		if (rv == -1 && errno == ENOSYS) {
+			can_openat2 = false;
+		} else {
+			return Fd(rv);
+		}
+	}
+
+	// No kernel support, use openat instead
+	return Fd(openat(parent_fd, pathname, flags));
+}
+
 Fd
 BeesRoots::open_root_nocache(uint64_t rootid)
 {
@@ -1820,7 +1847,7 @@ BeesRoots::open_root_nocache(uint64_t rootid)
 					}
 					// Theoretically there is only one, so don't bother looping.
 					BEESTRACE("dirid " << dirid << " path " << ino.m_paths.at(0));
-					parent_fd = openat(parent_fd, ino.m_paths.at(0).c_str(), FLAGS_OPEN_DIR);
+					parent_fd = bees_openat(parent_fd, ino.m_paths.at(0).c_str(), FLAGS_OPEN_DIR);
 					if (!parent_fd) {
 						BEESLOGTRACE("no parent_fd from dirid");
 						BEESCOUNT(root_parent_path_open_fail);
@@ -1829,7 +1856,7 @@ BeesRoots::open_root_nocache(uint64_t rootid)
 				}
 				// BEESLOG("openat(" << name_fd(parent_fd) << ", " << name << ")");
 				BEESTRACE("openat(" << name_fd(parent_fd) << ", " << name << ")");
-				Fd rv = openat(parent_fd, name.c_str(), FLAGS_OPEN_DIR);
+				Fd rv = bees_openat(parent_fd, name.c_str(), FLAGS_OPEN_DIR);
 				if (!rv) {
 					BEESLOGTRACE("open failed for name " << name << ": " << strerror(errno));
 					BEESCOUNT(root_open_fail);
@@ -1975,7 +2002,7 @@ BeesRoots::open_root_ino_nocache(uint64_t root, uint64_t ino)
 		// opening in write mode, and if we do open in write mode,
 		// we can't exec the file while we have it open.
 		const char *fp_cstr = file_path.c_str();
-		rv = openat(root_fd, fp_cstr, FLAGS_OPEN_FILE);
+		rv = bees_openat(root_fd, fp_cstr, FLAGS_OPEN_FILE);
 		if (!rv) {
 			// errno == ENOENT is the most common error case.
 			// No need to report it.
--- a/src/bees-trace.cc
+++ b/src/bees-trace.cc
@@ -91,9 +91,9 @@ BeesNote::~BeesNote()
 	tl_next = m_prev;
 	unique_lock<mutex> lock(s_mutex);
 	if (tl_next) {
-		s_status[crucible::gettid()] = tl_next;
+		s_status[gettid()] = tl_next;
 	} else {
-		s_status.erase(crucible::gettid());
+		s_status.erase(gettid());
 	}
 }

@@ -104,7 +104,7 @@ BeesNote::BeesNote(function<void(ostream &os)> f) :
 	m_prev = tl_next;
 	tl_next = this;
 	unique_lock<mutex> lock(s_mutex);
-	s_status[crucible::gettid()] = tl_next;
+	s_status[gettid()] = tl_next;
 }

 void
Author	SHA1	Message	Date
Zygo Blaxell	c53fa04a2f	task: fixes for priority and idle Tasks Tasks are not allowed to be queued more than once, but it is allowed to queue a Task while it's already running, which means a Task can be executed on two threads in parallel. Tasks detect this and handle it by queueing the Task on its own post-exec queue. That in turn leads to Workers which continually execute the same Task if that Task doesn't create any new Tasks, while other Tasks sit on the Master queue waiting for a Worker to dequeue them. For idle Tasks, we don't want the Task to be rescheduled immediately. We want the idle Task to execute again after every available Task on both the main and idle queues has been executed. Fix these by having each Task reschedule itself on the appropriate queue when it finishes executing. Priority queued Tasks should executed in priority order not just one Task's post-exec queue, but the entire local queue of the TaskConsumer. Fix this by moving the sort into either the TaskConsumer that receives a post-exec queue, if there is one, or into the Task that is created to insert the post-exec queue into a TaskConsumer when one becomes available in the future. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-15 00:43:25 -05:00
Zygo Blaxell	d4a681c8a2	Revert "roots: use a non-idle task for next_transid" next_transid tasks don't respect queue selection very well, because they effectively end up spinning in a loop until all other worker threads become busy. Back this out, and fix the priority handling in the Task library. This reverts commit `58db4071de`.	2025-01-12 18:48:33 -05:00
Zygo Blaxell	a819d623f7	task: do not allow queue loops in priority queueing mode Tasks using non-priority FIFO dependency tracking can insert themselves into their own queue, to run the Task again immediately after it exits. For priority queues, this attempts to splice the post-exec queue into itself, which doesn't seem like a good idea. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-12 15:28:26 -05:00
Zygo Blaxell	de9d72da80	task: flatten queues of dependent Tasks Suppose Task A, B, and C are created in that order, and currently running. Task T acquires Exclusion E. Task B, A, and C attempt to acquire the same Exclusion, in that order, but fail because Task T holds it. The result is Task T with a post-exec queue: T, [ B, A, C ] sort_requested Now suppose Task U acquires Exclusion F, then Task T attempts to acquire Exclusion F. Task T fails to acquire F, so T is inserted into U's post-exec queue. The result at the end of the execution of T is a tree: U, [ T ] sort_requested \-> [ B, A, C ] sort_requested Task T exits after failing to acquire a lock. When T exits, T will sort its post-exec queue and submit the post-exec queue for execution immediately: Worker 1: U, [ T ] sort_requested Worker 2: A, B, C This isn't ideal because T, A, B, and C all depend on at least one common Exclusion, so they are likely to immediately conflict with T when U exits and T runs again. Ideally, A, B, and C would at least remain in a common queue with T, and ideally that queue is sorted. Instead of inserting T into U's post-exec queue, insert T and all of T's post-exec queue, which creates a single flattened Task list: U, [ T, B, A, C ] sort_requested Then when U exits, it will sort [ T, B, A, C ] into [ A, B, C, T ], and run all of the queued Tasks in age priority order: U exited, [ T, B, A, C ] sort_requested U exited, [ A, B, C, T ] [ A, B, C, T ] on TaskConsumer queue Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-12 14:05:44 -05:00
Zygo Blaxell	74d8bdd60f	task: add an `insert` method for priority-queueing Tasks by age Task started out as a self-organizing parallel-make algorithm, but ended up becoming a half-broken wait-die algorithm. When a contended object is already locked, Tasks enter a FIFO queue to restart and acquire the lock. This is the "die" part of wait-die (all locks on an Exclusion are non-blocking, so no Task ever does "wait"). The lock queue is FIFO wrt _lock acquisition order_, not _Task age_ as required by the wait-die algorithm. Make it a 25%-broken wait-die algorithm by sorting the Tasks on lock queues in order of Task ID, i.e. oldest-first, or FIFO wrt Task age. This ensures the oldest Task waiting for an object is the one to get it when it becomes available, as expected from the wait-die algorithm. This should reduce the amount of time Tasks spend on the execution queue, and reduce memory usage by avoiding the accumulation of Tasks that cannot make forward progress. Note that turning `TaskQueue` into an ordered container would have undesirable side-effects: * `std::list` has some useful properties wrt stability of object location and cost of splicing. Other containers may not have these, and `std::list` does have a `sort` method. * Some Task objects are created at the beginning and reused continually, but we really do want those Tasks to be executed in FIFO order wrt submission, not Task ID. We can exclude these tasks by only doing the sorting when a Task is queued for an Exclusin object. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-12 00:35:37 -05:00
Zygo Blaxell	a5d078d48b	docs: deprecate the `--workaround-btrfs-send` option Emphasize that the option is relevant to old kernels, older than the minimum supportable version threshold. De-emphasize the use case of "send-workaround" as a synonym for "exclude read-only". Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-11 23:39:56 -05:00
Zygo Blaxell	e2587cae9b	docs: expand "Threads and load management" to suggest not running bees so much One of the more obvious ways to reduce bees load is to simply not run it all the time. Explicitly state using maintenance windows as a load management option. SIGUSR1 and SIGUSR2 should have been documented somewhere else before now. Better late than never. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-11 23:39:56 -05:00
Zygo Blaxell	ac581273d3	docs: config.md updates The theories behind bees slowing down when presented with a larger has table turned out to be wrong. The real cause was a very old bug which submitted thousands of `LOGICAL_INO` requests when only a handful of requests were needed. "Compression on the filesystem" -> "Compression in files" Don't be so "dramatic". Be "rapid" instead. Remove "cannot avoid modifying read-only snapshots" as a distinction between subvol and extent scans. Both modes support send workaround and send waiting with no significant distinction. Emphasize extent scan's better handling of many snapshots. Also reflinks. Add some discussion of `--throttle-factor`. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-11 23:39:56 -05:00
Zygo Blaxell	7fcde97b70	docs: update the bug reporting and status instructions Thread names have changed. Document some of the newer ones. Don't jump immediately to blaming poor performance on qgroups or autodefrag. These do sometimes have kernel regressions but not all the time. Emphasize advantage of controlling bees deferred work requests at the source, before btrfs gets stuck committing them. Avoid asserting that it's OK for gdb to crash. Remove mention of lower-layer block device issues wrt corruption. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-11 23:39:55 -05:00
Zygo Blaxell	e457f502b7	docs: update kernel bugs page for January 2025 "Kernel" -> "Linux kernel". If you can run bees on a kernel that isn't Linux, congratulations! Emphasize the age of the data corruption warnings. Once 5.4 reaches EOL we can remove those. Simplify the discussion of old kernels and API levels. There's a new optional kernel API for `openat2` support at 5.6. The absolute minimum kernel version is still 4.2, and will not increase to 4.15 until the subvol scanners are removed. Remove discussion of bees support for kernels 4.19 (which recently reached EOL) and earlier. The `LOGICAL_INO` vs dedupe bug is actually a `LOGICAL_INO` vs clone bug. Dedupe isn't necessary to reproduce it. Remove a stray ')'. Strip out most of the discussion of slow backrefs, as they are no longer a concern on the range of supported kernel versions. Leave some description there because bees still has some vestigial workarounds. Remove `btrfs send` from the "Unfixed kernel bugs" section, which makes the section empty, so remove the section too. bees now handles send on a subvol reasonably well. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-11 23:39:55 -05:00
Zygo Blaxell	46815f1a9d	docs: update README.md Emphasize "large" is an upper bound on the size of filesystem bees can handle. New strengths: largest extent first for fixed maintenance windows, scans data only once (ish), recovers more space Removed weaknesses: less temporary space Need more caps than `CAP_SYS_ADMIN`. Emphasize DATA CORRUPTION WARNING is an old-kernel thing. Update copyright year. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-11 23:39:55 -05:00
Zygo Blaxell	0d251d30f4	docs: update feature interaction lists Tested on larger filesystems than 100T too, but let's use Fermi approximation. Next size is 1P. Removed interaction with block-level SSD caching subsystems. These are really btrfs metadata vs. a lower block layer, and have nothing to do with bees. Added mixed block groups to the tested list, as mixed block groups required explicit support in the extent scanner. Added btrfs-convert to the tested list. btrfs-convert has various problems with space allocation in general, but these can be solved by carefully ordered balances after conversion, and they have nothing to do with bees. In-kernel dedupe is dead and the stubs were removed years ago. Remove it from the list. btrfs send now plays nicely with bees on all supportable kernels, now that stable/linux-4.19.y is dead. Send workaround is only needed for kernels before v5.4 (technically v5.2, but nobody should ever mount a btrfs with kernel v5.1 to v5.3). bees will pause automatically when deduping a subvol that is currently running a send. bees will no longer gratuitously refragment data that was defragmented by autodefrag. Explicitly list all the RAID profiles tested so far, as there have been some new ones. Explicitly list other deduplicators tested. Sort the list of btrfs features alphabetically. Add scrub and balance, which have been tested with bees since the beginning. New tested btrfs features: block-group-tree, raid1c3, raid1c4. New untested btrfs features: squotas, raid-stripe-tree. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-11 23:39:55 -05:00
Zygo Blaxell	b8dd9a2db0	progress: put a timestamp in the bottom row This records the time when the progress data was calculated, to help indicate when the data might be very old. While we're here, move "now" out of the loop so there's only one value. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-11 23:39:55 -05:00
Zygo Blaxell	8bc90b743b	task: get rid of the `insert_task` method Nothing calls it (not even tests), and there's significant functional overlap with `try_lock`. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-11 23:39:55 -05:00
Zygo Blaxell	2f2a68be3d	roots: use openat2 instead of openat when available This increases resistance to symlink and mount attacks. Previously, bees could follow a symlink or a mount point in a directory component of a subvol or file name. Once the file is opened, the open file descriptor would be checked to see if its subvol and inode matches the expected file in the target filesystem. Files that fail to match would be immediately closed. With openat2 resolve flags, symlinks and mount points terminate path resolution in the kernel. Paths that lead through symlinks or onto mount points cannot be opened at all. Fall back to openat() if openat2() returns ENOSYS, so bees will still run on kernels before v5.6. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-09 02:26:53 -05:00
Zygo Blaxell	82f1fd8054	process: replace crucible::gettid() with a weak symbol Since we're now using weak symbols for dodgy libc functions, we might as well do it for gettid() too. Use the ::gettid() global namespace and let libc override it. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-09 01:37:44 -05:00
Zygo Blaxell	a9b07d7684	openat2: create a weak syscall wrapper for it openat2 allows closing more TOCTOU holes, but we can only use it when the kernel supports it. This should disappear seamlessly when libc implements the function. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-09 01:36:39 -05:00