1
0
mirror of https://github.com/Zygo/bees.git synced 2025-08-02 22:03:29 +02:00

17 Commits

Author SHA1 Message Date
Zygo Blaxell
124507232f docs: add vmalloc bug to kernel bugs list
The bug is:

	v6.3-rc6: f349b15e183d mm: vmalloc: avoid warn_alloc noise caused by fatal signal

The fixes are:

	v6.4: 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
	v6.3.10: c189994b5dd3 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails

The bug has been backported to LTS, but the fix has not:

	v6.2.11: 61334bc29781 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
	v6.1.24: ef6bd8f64ce0 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
	v5.15.107: a184df0de132 mm: vmalloc: avoid warn_alloc noise caused by fatal signal

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-07-06 13:50:12 -04:00
Zygo Blaxell
3c5e13c885 context: log when LOGICAL_INO returns 0 refs
There was a bug in kernel 6.3 where LOGICAL_INO with IGNORE_OFFSET
sometimes fails to ignore the offset.  That bug is now fixed, but
LOGICAL_INO still returns 0 refs much more often than seems appropriate.

This is most likely because bees frequently deletes extents while there
is still work waiting for them in Task queues.  In this case, LOGICAL_INO
correctly returns an empty list, because every reference to some extent
is deleted, but the new extent tree with that extent removed is not yet
committed in btrfs.

Add a DEBUG-level log message and an event counter to track these events.
In the absence of a kernel bug, the debug message may indicate CPU time
was wasted performing a search whose outcome could have been predicted.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-07-06 12:54:33 -04:00
Zygo Blaxell
a6ca2fa2f6 docs: add IGNORE_OFFSET regression in 6.2..6.3 to kernel bugs list
This doesn't impact the current bees master, but it does break bees-next.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-07-06 12:49:36 -04:00
Zygo Blaxell
3f23a0c73f context: downgrade toxic extent workaround message
Toxic extents are much less of a problem now than they were in kernels
before 5.7.  Downgrade the log message level to reflect their lesser
importance.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-07-06 12:49:36 -04:00
Zygo Blaxell
d6732c58e2 test: GCC 13 fix for limits.cc
GCC complains that #include <cstdint> is missing, so add that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-05-07 21:24:21 -04:00
Zygo Blaxell
75b2067cef btrfs-tree: fix build on clang++16
The "loops" variable isn't read (only set) if not built with extra
debug code.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-05-07 21:23:27 -04:00
Zygo Blaxell
da3ef216b1 docs: working around btrfs send issues isn't really a feature
The critical kernel bugs in send have been fixed for years.
The limitations that remain aren't bugs, and bees has no sustainable
workaround for them.

Also update copyright year range.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-03-07 10:25:51 -05:00
Zygo Blaxell
b7665d49d9 docs: fill in missing LTS backports for "1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match"
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-03-07 10:17:44 -05:00
Zygo Blaxell
717bdf5eb5 roots: make sure transid_max's computed value isn't max
We check the result of transid_max_nocache(), but not the result of
transid_max().  The latter is a computed result that is even more likely
to be wrong[citation needed].

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:45:29 -05:00
Zygo Blaxell
9b60f2b94d docs: add "missing" features that have been in development for some time already
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:42:42 -05:00
Zygo Blaxell
8978d63e75 docs: update GCC versions list and clarify markdown statement
I don't know if anyone else is testing GCC versions before 8.0 any more,
but I'm not.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:39:55 -05:00
Zygo Blaxell
82474b4ef4 docs: update front page
At least one user was significantly confused by "designed for large
filesystems".

The btrfs send workarounds aren't new any more.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:38:50 -05:00
Zygo Blaxell
73834beb5a docs: minor changes to how-it-works based on past user questions
Clarify that "too large" and "too small" are some distance away from each other.
The Goldilocks zone is _wide_.

The interval between cache drops is now shorter.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:37:37 -05:00
Zygo Blaxell
c92ba117d8 docs: various gotcha updates
Fixing the obviously wrong and out of date stuff.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:37:23 -05:00
Zygo Blaxell
c354e77634 docs: simplify the exit-with-SIGTERM description
The description now matches the code again.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:36:44 -05:00
Zygo Blaxell
f21569e88c docs: update the feature interactions page
Fixing the obviously out-of-date and no-longer-tested things.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:34:22 -05:00
Zygo Blaxell
3d5ebe4d40 docs: update kernel bugs and workarounds list for 6.2.0
Remove some of the repetition to make the document easier to edit.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:32:52 -05:00
14 changed files with 148 additions and 147 deletions

View File

@@ -17,7 +17,6 @@ Strengths
* Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
* Daemon incrementally dedupes new data using btrfs tree search
* Works with btrfs compression - dedupe any combination of compressed and uncompressed files
* **NEW** [Works around `btrfs send` problems with dedupe and incremental parent snapshots](docs/options.md)
* Works around btrfs filesystem structure to free more disk space
* Persistent hash table for rapid restart after shutdown
* Whole-filesystem dedupe - including snapshots
@@ -70,6 +69,6 @@ You can also use Github:
Copyright & License
-------------------
Copyright 2015-2022 Zygo Blaxell <bees@furryterror.org>.
Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
GPL (version 3 or later).

View File

@@ -7,23 +7,24 @@ First, a warning that is not specific to bees:
severe regression that can lead to fatal metadata corruption.**
This issue is fixed in kernel 5.4.14 and later.
**Recommended kernel versions for bees are 4.19, 5.4, 5.10, 5.11, or 5.12,
with recent LTS and -stable updates.** The latest released kernel as
of this writing is 5.18.18.
**Recommended kernel versions for bees are 4.19, 5.4, 5.10, 5.11, 5.15,
6.0, or 6.1, with recent LTS and -stable updates.** The latest released
kernel as of this writing is 6.4.1.
4.14, 4.9, and 4.4 LTS kernels with recent updates are OK with
some issues. Older kernels will be slower (a little slower or a lot
slower depending on which issues are triggered). Not all fixes are
backported.
4.14, 4.9, and 4.4 LTS kernels with recent updates are OK with some
issues. Older kernels will be slower (a little slower or a lot slower
depending on which issues are triggered). Not all fixes are backported.
Obsolete non-LTS kernels have a variety of unfixed issues and should
not be used with btrfs. For details see the table below.
bees requires btrfs kernel API version 4.2 or higher, and does not work
on older kernels.
at all on older kernels.
bees will detect and use btrfs kernel API up to version 4.15 if present.
In some future bees release, this API version may become mandatory.
Some bees features rely on kernel 4.15 to work, and these features will
not be available on older kernels. Currently, bees is still usable on
older kernels with degraded performance or with options disabled, but
support for older kernels may be removed.
@@ -58,14 +59,17 @@ These bugs are particularly popular among bees users, though not all are specifi
| - | 5.8 | deadlock in `TREE_SEARCH` ioctl (core component of bees filesystem scanner), followed by regression in deadlock fix | 4.4.237, 4.9.237, 4.14.199, 4.19.146, 5.4.66, 5.8.10 and later | a48b73eca4ce btrfs: fix potential deadlock in the search ioctl, 1c78544eaa46 btrfs: fix wrong address when faulting in pages in the search ioctl
| 5.7 | 5.10 | kernel crash if balance receives fatal signal e.g. Ctrl-C | 5.4.93, 5.10.11, 5.11 and later | 18d3bff411c8 btrfs: don't get an EINTR during drop_snapshot for reloc
| 5.10 | 5.10 | 20x write performance regression | 5.10.8, 5.11 and later | e076ab2a2ca7 btrfs: shrink delalloc pages instead of full inodes
| 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
| 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.4.125, 5.10.43, 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
| - | 5.11 | tree mod log issue #5 | 4.4.263, 4.9.263, 4.14.227, 4.19.183, 5.4.108, 5.10.26, 5.11.9, 5.12 and later | dbcc7d57bffc btrfs: fix race when cloning extent buffer during rewind of an old root
| - | 5.12 | tree mod log issue #6 | 4.14.233, 4.19.191, 5.4.118, 5.10.36, 5.11.20, 5.12.3, 5.13 and later | f9690f426b21 btrfs: fix race when picking most recent mod log operation for an old root
| 4.15 | 5.16 | spurious warnings from `fs/fs-writeback.c` when `flushoncommit` is enabled | 5.15.27, 5.16.13, 5.17 and later | a0f0cf8341e3 btrfs: get rid of warning on transaction commit when using flushoncommit
| - | 5.17 | crash during device removal can make filesystem unmountable | 5.15.54, 5.16.20, 5.17.3, 5.18 and later | bbac58698a55 btrfs: remove device item and update super block in the same transaction
| - | 5.18 | wrong superblock num_devices makes filesystem unmountable | 4.14.283, 4.19.247, 5.4.198, 5.10.121, 5.15.46, 5.17.14, 5.18.3, 5.19 and later | d201238ccd2f btrfs: repair super block num_devices automatically
| 5.18 | 5.19 | parent transid verify failed during log tree replay after a crash during a rename operation | 5.18.18, 5.19.2, 6.0 and later | 723df2bcc9e1 btrfs: join running log transaction when logging new name
| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe ioctl | - | workaround: reduce bees thread count to 1 with `-c1`
| 5.12 | 6.0 | space cache corruption and potential double allocations | 5.15.65, 5.19.6, 6.0 and later | ced8ecf026fd btrfs: fix space cache corruption and potential double allocations
| 6.3, backported to 5.15.107, 6.1.24, 6.2.11 | 6.3 | vmalloc error, failed to allocate pages | 6.3.10, 6.4 and later. Bug (f349b15e183d "mm: vmalloc: avoid warn_alloc noise caused by fatal signal" in v6.3-rc6) backported to 6.1.24, 6.2.11, and 5.15.107. | 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
| 6.2 | 6.3 | `IGNORE_OFFSET` flag ignored in `LOGICAL_INO` ioctl | 6.2.16, 6.3.3, 6.4 and later | 0cad8f14d70c btrfs: fix backref walking not returning all inode refs
| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe ioctl on the same extent | - | workaround: avoid doing that
"Last bad kernel" refers to that version's last stable update from
kernel.org. Distro kernels may backport additional fixes. Consult
@@ -80,21 +84,45 @@ through 5.4.13 inclusive.
A "-" for "first bad kernel" indicates the bug has been present since
the relevant feature first appeared in btrfs.
A "-" for "last bad kernel" indicates the bug has not yet been fixed as
of 5.18.18.
A "-" for "last bad kernel" indicates the bug has not yet been fixed in
current kernels (see top of this page for which kernel version that is).
In cases where issues are fixed by commits spread out over multiple
kernel versions, "fixed kernel version" refers to the version that
contains all components of the fix.
contains the last committed component of the fix.
Workarounds for known kernel bugs
---------------------------------
* **Hangs with high worker thread counts**: On kernels newer than
5.4, multiple threads running `LOGICAL_INO` and dedupe ioctls
at the same time can lead to a kernel hang. The workaround is
to reduce the thread count to 1 with `-c1`.
* **Hangs with concurrent `LOGICAL_INO` and dedupe**: on all
kernel versions so far, multiple threads running `LOGICAL_INO`
and dedupe ioctls at the same time on the same inodes or extents
can lead to a kernel hang. The kernel enters an infinite loop in
`add_all_parents`, where `count` is 0, `ref->count` is 1, and
`btrfs_next_item` or `btrfs_next_old_item` never find a matching ref).
bees has two workarounds for this bug: 1. schedule work so that multiple
threads do not simultaneously access the same inode or the same extent,
and 2. use a brute-force global lock within bees that prevents any
thread from running `LOGICAL_INO` while any other thread is running
dedupe.
Workaround #1 isn't really a workaround, since we want to do the same
thing for unrelated performance reasons. If multiple threads try to
perform dedupe operations on the same extent or inode, btrfs will make
all the threads wait for the same locks anyway, so it's better to have
bees find some other inode or extent to work on while waiting for btrfs
to finish.
Workaround #2 doesn't seem to be needed after implementing workaround
#1, but it's better to be slightly slower than to hang one CPU core
and the filesystem until the kernel is rebooted.
It is still theoretically possible to trigger the kernel bug when
running bees at the same time as other dedupers, or other programs
that use `LOGICAL_INO` like `btdu`; however, it's extremely difficult
to reproduce the bug without closely cooperating threads.
* **Slow backrefs** (aka toxic extents): Under certain conditions,
if the number of references to a single shared extent grows too
@@ -110,8 +138,8 @@ Workarounds for known kernel bugs
at this time of writing only bees has a workaround for this bug.
This workaround is less necessary for kernels 5.4.96, 5.7 and later,
though it can still take 2 ms of CPU to resolve each extent ref on a
fast machine on a large, heavily fragmented file.
though the bees workaround can still be triggered on newer kernels
by changes in btrfs since kernel version 5.1.
* **dedupe breaks `btrfs send` in old kernels**. The bees option
`--workaround-btrfs-send` prevents any modification of read-only subvols
@@ -127,8 +155,6 @@ Workarounds for known kernel bugs
Unfixed kernel bugs
-------------------
As of 5.18.18:
* **The kernel does not permit `btrfs send` and dedupe to run at the
same time**. Recent kernels no longer crash, but now refuse one
operation with an error if the other operation was already running.

View File

@@ -8,44 +8,35 @@ bees has been tested in combination with the following:
* HOLE extents and btrfs no-holes feature
* Other deduplicators, reflink copies (though bees may decide to redo their work)
* btrfs snapshots and non-snapshot subvols (RW and RO)
* Concurrent file modification (e.g. PostgreSQL and sqlite databases, build daemons)
* all btrfs RAID profiles
* Concurrent file modification (e.g. PostgreSQL and sqlite databases, VMs, build daemons)
* All btrfs RAID profiles
* IO errors during dedupe (read errors will throw exceptions, bees will catch them and skip over the affected extent)
* Filesystems mounted *with* the flushoncommit option ([lots of harmless kernel log warnings on 4.15 and later](btrfs-kernel.md))
* Filesystems mounted *without* the flushoncommit option
* Filesystems mounted with or without the `flushoncommit` option
* 4K filesystem data block size / clone alignment
* 64-bit and 32-bit LE host CPUs (amd64, x86, arm)
* Huge files (>1TB--although Btrfs performance on such files isn't great in general)
* filesystems up to 30T+ bytes, 100M+ files
* Large files (kernel 5.4 or later strongly recommended)
* Filesystems up to 90T+ bytes, 1000M+ files
* btrfs receive
* btrfs nodatacow/nodatasum inode attribute or mount option (bees skips all nodatasum files)
* open(O_DIRECT) (seems to work as well--or as poorly--with bees as with any other btrfs feature)
* lvmcache: no problems observed in testing with recent kernels or reported by users in the last year.
* lvm dm-cache, writecache
Bad Btrfs Feature Interactions
------------------------------
bees has been tested in combination with the following, and various problems are known:
* bcache: no data-losing problems observed in testing with recent kernels
or reported by users in the last year. Some issues observed with
bcache interacting badly with some SSD models' firmware, but so far
this only causes temporary loss of service, not filesystem damage.
This behavior does not seem to be specific to bees (ordinary filesystem
tests with rsync and snapshots will reproduce it), but it does prevent
any significant testing of bees on bcache.
* btrfs send: there are bugs in `btrfs send` that can be triggered by bees.
The [`--workaround-btrfs-send` option](options.md) works around this issue
by preventing bees from modifying read-only snapshots.
* btrfs send: there are bugs in `btrfs send` that can be triggered by
bees on old kernels. The [`--workaround-btrfs-send` option](options.md)
works around this issue by preventing bees from modifying read-only
snapshots.
* btrfs qgroups: very slow, sometimes hangs...and it's even worse when
bees is running.
* btrfs autodefrag mount option: hangs and high CPU usage problems
reported by users. bees cannot distinguish autodefrag activity from
normal filesystem activity and will likely try to undo the autodefrag
if duplicate copies of the defragmented data exist.
* btrfs autodefrag mount option: bees cannot distinguish autodefrag
activity from normal filesystem activity, and may try to undo the
autodefrag if duplicate copies of the defragmented data exist.
Untested Btrfs Feature Interactions
-----------------------------------
@@ -54,9 +45,10 @@ bees has not been tested with the following, and undesirable interactions may oc
* Non-4K filesystem data block size (should work if recompiled)
* Non-equal hash (SUM) and filesystem data block (CLONE) sizes (need to fix that eventually)
* btrfs seed filesystems (does anyone even use those?)
* btrfs out-of-tree kernel patches (e.g. in-kernel dedupe or encryption)
* btrfs seed filesystems (no particular reason it wouldn't work, but no one has reported trying)
* btrfs out-of-tree kernel patches (e.g. in-kernel dedupe, encryption, extent tree v2)
* btrfs-convert from ext2/3/4 (never tested, might run out of space or ignore significant portions of the filesystem due to sanity checks)
* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
* flashcache: an out-of-tree cache-HDD-on-SSD block layer helper.
* Host CPUs with exotic page sizes, alignment requirements, or endianness (ppc, alpha, sparc, strongarm, s390, mips, m68k...)
* bcache: used to be in the "bad" list, now in the "untested" list because nobody is rigorously testing, and bcache bugs come and go
* flashcache: an out-of-tree cache-HDD-on-SSD block layer helper

View File

@@ -8,9 +8,10 @@ are reasonable in most cases.
Hash Table Sizing
-----------------
Hash table entries are 16 bytes per data block. The hash table stores
the most recently read unique hashes. Once the hash table is full,
each new entry in the table evicts an old entry.
Hash table entries are 16 bytes per data block. The hash table stores the
most recently read unique hashes. Once the hash table is full, each new
entry added to the table evicts an old entry. This makes the hash table
a sliding window over the most recently scanned data from the filesystem.
Here are some numbers to estimate appropriate hash table sizes:
@@ -25,9 +26,11 @@ Here are some numbers to estimate appropriate hash table sizes:
Notes:
* If the hash table is too large, no extra dedupe efficiency is
obtained, and the extra space just wastes RAM. Extra space can also slow
bees down by preventing old data from being evicted, so bees wastes time
looking for matching data that is no longer present on the filesystem.
obtained, and the extra space wastes RAM. If the hash table contains
more block records than there are blocks in the filesystem, the extra
space can slow bees down. A table that is too large prevents obsolete
data from being evicted, so bees wastes time looking for matching data
that is no longer present on the filesystem.
* If the hash table is too small, bees extrapolates from matching
blocks to find matching adjacent blocks in the filesystem that have been
@@ -36,6 +39,10 @@ one block in common between two extents in order to be able to dedupe
the entire extents. This provides significantly more dedupe hit rate
per hash table byte than other dedupe tools.
* There is a fairly wide range of usable hash sizes, and performances
degrades according to a smooth probabilistic curve in both directions.
Double or half the optimium size usually works just as well.
* When counting unique data in compressed data blocks to estimate
optimum hash table size, count the *uncompressed* size of the data.
@@ -66,11 +73,11 @@ data on an uncompressed filesystem. Dedupe efficiency falls dramatically
with hash tables smaller than 128MB/TB as the average dedupe extent size
is larger than the largest possible compressed extent size (128KB).
* **Short writes** also shorten the average extent length and increase
optimum hash table size. If a database writes to files randomly using
4K page writes, all of these extents will be 4K in length, and the hash
table size must be increased to retain each one (or the user must accept
a lower dedupe hit rate).
* **Short writes or fragmentation** also shorten the average extent
length and increase optimum hash table size. If a database writes to
files randomly using 4K page writes, all of these extents will be 4K
in length, and the hash table size must be increased to retain each one
(or the user must accept a lower dedupe hit rate).
Defragmenting files that have had many short writes increases the
extent length and therefore reduces the optimum hash table size.

View File

@@ -296,6 +296,7 @@ resolve
The `resolve` event group consists of operations related to translating a btrfs virtual block address (i.e. physical block address) to a `(root, inode, offset)` tuple (i.e. locating and opening the file containing a matching block). `resolve` is the top level, `chase` and `adjust` are the lower two levels.
* `resolve_empty`: The `LOGICAL_INO` ioctl returned successfully with an empty reference list (0 items).
* `resolve_fail`: The `LOGICAL_INO` ioctl returned an error.
* `resolve_large`: The `LOGICAL_INO` ioctl returned more than 2730 results (the limit of the v1 ioctl).
* `resolve_ms`: Total time spent in the `LOGICAL_INO` ioctl (i.e. wallclock time, not kernel CPU time).

View File

@@ -51,81 +51,40 @@ loops early. The exception text in this case is:
Terminating bees with SIGTERM
-----------------------------
bees is designed to survive host crashes, so it is safe to terminate
bees using SIGKILL; however, when bees next starts up, it will repeat
some work that was performed between the last bees crawl state save point
and the SIGKILL (up to 15 minutes). If bees is stopped and started less
than once per day, then this is not a problem as the proportional impact
is quite small; however, users who stop and start bees daily or even
more often may prefer to have a clean shutdown with SIGTERM so bees can
restart faster.
bees is designed to survive host crashes, so it is safe to terminate bees
using SIGKILL; however, when bees next starts up, it will repeat some
work that was performed between the last bees crawl state save point
and the SIGKILL (up to 15 minutes), and a large hash table may not be
completely written back to disk, so some duplicate matches will be lost.
bees handling of SIGTERM can take a long time on machines with some or
all of:
If bees is stopped and started less than once per week, then this is not
a problem as the proportional impact is quite small; however, users who
stop and start bees daily or even more often may prefer to have a clean
shutdown with SIGTERM so bees can restart faster.
* Large RAM and `vm.dirty_ratio`
* Large number of active bees worker threads
* Large number of bees temporary files (proportional to thread count)
* Large hash table size
* Large filesystem size
* High IO latency, especially "low power" spinning disks
* High filesystem activity, especially duplicate data writes
The shutdown procedure performs these steps:
Each of these factors individually increases the total time required
to perform a clean bees shutdown. When combined, the factors can
multiply with each other, dramatically increasing the time required to
flush bees state to disk.
On a large system with many of the above factors present, a "clean"
bees shutdown can take more than 20 minutes. Even a small machine
(16GB RAM, 1GB hash table, 1TB NVME disk) can take several seconds to
complete a SIGTERM shutdown.
The shutdown procedure performs potentially long-running tasks in
this order:
1. Worker threads finish executing their current Task and exit.
Threads executing `LOGICAL_INO` ioctl calls usually finish quickly,
but btrfs imposes no limit on the ioctl's running time, so it
can take several minutes in rare bad cases. If there is a btrfs
commit already in progress on the filesystem, then most worker
threads will be blocked until the btrfs commit is finished.
2. Crawl state is saved to `$BEESHOME`. This normally completes
relatively quickly (a few seconds at most). This is the most
1. Crawl state is saved to `$BEESHOME`. This is the most
important bees state to save to disk as it directly impacts
restart time, so it is done as early as possible (but no earlier).
restart time, so it is done as early as possible
3. Hash table is written to disk. Normally the hash table is
trickled back to disk at a rate of about 2GB per hour;
2. Hash table is written to disk. Normally the hash table is
trickled back to disk at a rate of about 128KiB per second;
however, SIGTERM causes bees to attempt to flush the whole table
immediately. If bees has recently been idle then the hash table is
likely already flushed to disk, so this step will finish quickly;
however, if bees has recently been active and the hash table is
large relative to RAM size, the blast of rapidly written data
can force the Linux VFS to block all writes to the filesystem
for sufficient time to complete all pending btrfs metadata
writes which accumulated during the btrfs commit before bees
received SIGTERM...and _then_ let bees write out the hash table.
The time spent here depends on the size of RAM, speed of disks,
and aggressiveness of competing filesystem workloads.
immediately. The time spent here depends on the size of RAM, speed
of disks, and aggressiveness of competing filesystem workloads.
It can trigger `vm.dirty_bytes` limits and block other processes
writing to the filesystem for a while.
4. bees temporary files are closed, which implies deletion of their
inodes. These are files which consist entirely of shared extent
structures, and btrfs takes an unusually long time to delete such
files (up to a few minutes for each on slow spinning disks).
3. The bees process calls `_exit`, which terminates all running
worker threads, closes and deletes all temporary files. This
can take a while _after_ the bees process exits, especially on
slow spinning disks.
If bees is terminated with SIGKILL, only step #1 and #4 are performed (the
kernel performs these automatically if bees exits). This reduces the
shutdown time at the cost of increased startup time.
Balances
--------
First, read [`LOGICAL_INO` and btrfs balance WARNING](btrfs-kernel.md).
bees will suspend operations during a btrfs balance to work around
kernel bugs.
A btrfs balance relocates data on disk by making a new copy of the
data, replacing all references to the old data with references to the
new copy, and deleting the old copy. To bees, this is the same as any
@@ -175,7 +134,9 @@ the beginning.
Each time bees dedupes an extent that is referenced by a snapshot,
the entire metadata page in the snapshot subvol (16KB by default) must
be CoWed in btrfs. This can result in a substantial increase in btrfs
be CoWed in btrfs. Since all references must be removed at the same
time, this CoW operation is repeated in every snapshot containing the
duplicate data. This can result in a substantial increase in btrfs
metadata size if there are many snapshots on a filesystem.
Normally, metadata is small (less than 1% of the filesystem) and dedupe
@@ -252,17 +213,18 @@ Other Gotchas
filesystem while `LOGICAL_INO` is running. Generally the CPU spends
most of the runtime of the `LOGICAL_INO` ioctl running the kernel,
so on a single-core CPU the entire system can freeze up for a second
during operations on toxic extents.
during operations on toxic extents. Note this only occurs on older
kernels. See [the slow backrefs kernel bug section](btrfs-kernel.md).
* If a process holds a directory FD open, the subvol containing the
directory cannot be deleted (`btrfs sub del` will start the deletion
process, but it will not proceed past the first open directory FD).
`btrfs-cleaner` will simply skip over the directory *and all of its
children* until the FD is closed. bees avoids this gotcha by closing
all of the FDs in its directory FD cache every 10 btrfs transactions.
all of the FDs in its directory FD cache every btrfs transaction.
* If a file is deleted while bees is caching an open FD to the file,
bees continues to scan the file. For very large files (e.g. VM
images), the deletion of the file can be delayed indefinitely.
To limit this delay, bees closes all FDs in its file FD cache every
10 btrfs transactions.
btrfs transaction.

View File

@@ -8,10 +8,12 @@ bees uses checkpoints for persistence to eliminate the IO overhead of a
transactional data store. On restart, bees will dedupe any data that
was added to the filesystem since the last checkpoint. Checkpoints
occur every 15 minutes for scan progress, stored in `beescrawl.dat`.
The hash table trickle-writes to disk at 4GB/hour to `beeshash.dat`.
An hourly performance report is written to `beesstats.txt`. There are
no special requirements for bees hash table storage--`.beeshome` could
be stored on a different btrfs filesystem, ext4, or even CIFS.
The hash table trickle-writes to disk at 128KiB/s to `beeshash.dat`,
but will flush immediately if bees is terminated by SIGTERM.
There are no special requirements for bees hash table storage--`.beeshome`
could be stored on a different btrfs filesystem, ext4, or even CIFS (but
not MS-DOS--beeshome does need filenames longer than 8.3).
bees uses a persistent dedupe hash table with a fixed size configured
by the user. Any size of hash table can be dedicated to dedupe. If a
@@ -20,7 +22,7 @@ small as 128KB.
The bees hash table is loaded into RAM at startup and `mlock`ed so it
will not be swapped out by the kernel (if swap is permitted, performance
degrades to nearly zero).
degrades to nearly zero, for both bees and the swap device).
bees scans the filesystem in a single pass which removes duplicate
extents immediately after they are detected. There are no distinct
@@ -83,12 +85,12 @@ of these functions in userspace, at the expense of encountering [some
kernel bugs in `LOGICAL_INO` performance](btrfs-kernel.md).
bees uses only the data-safe `FILE_EXTENT_SAME` (aka `FIDEDUPERANGE`)
kernel operations to manipulate user data, so it can dedupe live data
(e.g. build servers, sqlite databases, VM disk images). It does not
modify file attributes or timestamps.
kernel ioctl to manipulate user data, so it can dedupe live data
(e.g. build servers, sqlite databases, VM disk images). bees does not
modify file attributes or timestamps in deduplicated files.
When bees has scanned all of the data, bees will pause until 10
transactions have been completed in the btrfs filesystem. bees tracks
When bees has scanned all of the data, bees will pause until a new
transaction has completed in the btrfs filesystem. bees tracks
the current btrfs transaction ID over time so that it polls less often
on quiescent filesystems and more often on busy filesystems.

View File

@@ -17,7 +17,6 @@ Strengths
* Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
* Daemon incrementally dedupes new data using btrfs tree search
* Works with btrfs compression - dedupe any combination of compressed and uncompressed files
* **NEW** [Works around `btrfs send` problems with dedupe and incremental parent snapshots](options.md)
* Works around btrfs filesystem structure to free more disk space
* Persistent hash table for rapid restart after shutdown
* Whole-filesystem dedupe - including snapshots
@@ -70,6 +69,6 @@ You can also use Github:
Copyright & License
-------------------
Copyright 2015-2022 Zygo Blaxell <bees@furryterror.org>.
Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
GPL (version 3 or later).

View File

@@ -4,7 +4,7 @@ Building bees
Dependencies
------------
* C++11 compiler (tested with GCC 4.9, 6.3.0, 8.1.0)
* C++11 compiler (tested with GCC 8.1.0, 12.2.0)
Sorry. I really like closures and shared_ptr, so support
for earlier compiler versions is unlikely.
@@ -19,7 +19,7 @@ Dependencies
* [Linux kernel version](btrfs-kernel.md) gets its own page.
* markdown for documentation
* markdown to build the documentation
* util-linux version that provides `blkid` command for the helper
script `scripts/beesd` to work

View File

@@ -2,8 +2,8 @@ Features You Might Expect That bees Doesn't Have
------------------------------------------------
* There's no configuration file (patches welcome!). There are
some tunables hardcoded in the source that could eventually become
configuration options. There's also an incomplete option parser
some tunables hardcoded in the source (`src/bees.h`) that could eventually
become configuration options. There's also an incomplete option parser
(patches welcome!).
* The bees process doesn't fork and writes its log to stdout/stderr.
@@ -43,3 +43,6 @@ compression method or not compress the data (patches welcome!).
* It is theoretically possible to resize the hash table without starting
over with a new full-filesystem scan; however, this feature has not been
implemented yet.
* btrfs maintains csums of data blocks which bees could use to improve
scan speeds, but bees doesn't use them yet.

View File

@@ -548,7 +548,7 @@ namespace crucible {
#endif
const uint64_t logical_end = logical + count * block_size();
BtrfsTreeItem bti = rlower_bound(logical);
size_t loops = 0;
size_t __attribute__((unused)) loops = 0;
BCTFGS_DEBUG("get_sums " << to_hex(logical) << ".." << to_hex(logical_end) << endl);
while (!!bti) {
BCTFGS_DEBUG("get_sums[" << loops << "]: " << bti << endl);

View File

@@ -821,6 +821,10 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
// Avoid performance problems - pretend resolve failed if there are too many refs
const size_t rv_count = log_ino.m_iors.size();
if (!rv_count) {
BEESLOGDEBUG("LOGICAL_INO returned 0 refs at " << to_hex(addr));
BEESCOUNT(resolve_empty);
}
if (rv_count < BEES_MAX_EXTENT_REF_COUNT) {
rv.m_biors = vector<BtrfsInodeOffsetRoot>(log_ino.m_iors.begin(), log_ino.m_iors.end());
} else {
@@ -832,7 +836,7 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
if (sys_usage_delta < BEES_TOXIC_SYS_DURATION) {
rv.m_is_toxic = false;
} else {
BEESLOGNOTICE("WORKAROUND: toxic address: addr = " << addr << ", sys_usage_delta = " << round(sys_usage_delta* 1000.0) / 1000.0 << ", user_usage_delta = " << round(user_usage_delta * 1000.0) / 1000.0 << ", rt_age = " << rt_age << ", refs " << rv_count);
BEESLOGDEBUG("WORKAROUND: toxic address: addr = " << addr << ", sys_usage_delta = " << round(sys_usage_delta* 1000.0) / 1000.0 << ", user_usage_delta = " << round(user_usage_delta * 1000.0) / 1000.0 << ", rt_age = " << rt_age << ", refs " << rv_count);
BEESCOUNT(resolve_toxic);
rv.m_is_toxic = true;
}

View File

@@ -515,7 +515,12 @@ BeesRoots::transid_max_nocache()
uint64_t
BeesRoots::transid_max()
{
return m_transid_re.count();
const auto rv = m_transid_re.count();
// transid must be greater than zero, or we did something very wrong
THROW_CHECK1(runtime_error, rv, rv > 0);
// transid must be less than max, or we did something very wrong
THROW_CHECK1(runtime_error, rv, rv < numeric_limits<uint64_t>::max());
return rv;
}
struct BeesFileCrawl {

View File

@@ -3,6 +3,7 @@
#include "crucible/limits.h"
#include <cassert>
#include <cstdint>
using namespace crucible;