mirror of
https://github.com/Zygo/bees.git
synced 2025-08-01 13:23:28 +02:00
Compare commits
75 Commits
v0.11-rc2
...
ee5c971d77
Author | SHA1 | Date | |
---|---|---|---|
|
ee5c971d77 | ||
|
d37f916507 | ||
|
3a17a4dcdd | ||
|
4039ef229e | ||
|
e9d4aa4586 | ||
|
504f4cda80 | ||
|
6c36f4973f | ||
|
b1bd99c077 | ||
|
d5e805ab8d | ||
|
337bbffac1 | ||
|
527396e5cb | ||
|
bc7c35aa2d | ||
|
0953160584 | ||
|
80f9c147f7 | ||
|
50e012ad6d | ||
|
9a9644659c | ||
|
fd53bff959 | ||
|
9439dad93a | ||
|
ef9b4b3a50 | ||
|
7ca857dff0 | ||
|
8331f70db7 | ||
|
a844024395 | ||
|
47243aef14 | ||
|
a670aa5a71 | ||
|
51b3bcdbe4 | ||
|
ae58401d53 | ||
|
3e7eb43b51 | ||
|
962d94567c | ||
|
6dbef5f27b | ||
|
88b1e4ca6e | ||
|
c1d7fa13a5 | ||
|
aa39bddb2d | ||
|
1aea2d2f96 | ||
|
673b450671 | ||
|
183b6a5361 | ||
|
b6446d7316 | ||
|
d32f31f411 | ||
|
dd08f6379f | ||
|
58ee297cde | ||
|
a3c0ba0d69 | ||
|
75040789c6 | ||
|
f9a697518d | ||
|
c4ba6ec269 | ||
|
440740201a | ||
|
f6908420ad | ||
|
925b12823e | ||
|
561e604edc | ||
|
30cd375d03 | ||
|
48b7fbda9c | ||
|
85aba7b695 | ||
|
de38b46dd8 | ||
|
0abf6ebb3d | ||
|
360ce7e125 | ||
|
ad11db2ee1 | ||
|
874832dc58 | ||
|
5fe89d85c3 | ||
|
a2b3e1e0c2 | ||
|
aaec931081 | ||
|
c53fa04a2f | ||
|
d4a681c8a2 | ||
|
a819d623f7 | ||
|
de9d72da80 | ||
|
74d8bdd60f | ||
|
a5d078d48b | ||
|
e2587cae9b | ||
|
ac581273d3 | ||
|
7fcde97b70 | ||
|
e457f502b7 | ||
|
46815f1a9d | ||
|
0d251d30f4 | ||
|
b8dd9a2db0 | ||
|
8bc90b743b | ||
|
2f2a68be3d | ||
|
82f1fd8054 | ||
|
a9b07d7684 |
26
README.md
26
README.md
@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
|
||||
About bees
|
||||
----------
|
||||
|
||||
bees is a block-oriented userspace deduplication agent designed for large
|
||||
btrfs filesystems. It is an offline dedupe combined with an incremental
|
||||
data scan capability to minimize time data spends on disk from write
|
||||
to dedupe.
|
||||
bees is a block-oriented userspace deduplication agent designed to scale
|
||||
up to large btrfs filesystems. It is an offline dedupe combined with
|
||||
an incremental data scan capability to minimize time data spends on disk
|
||||
from write to dedupe.
|
||||
|
||||
Strengths
|
||||
---------
|
||||
|
||||
* Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
|
||||
* Daemon incrementally dedupes new data using btrfs tree search
|
||||
* Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
|
||||
* Daemon mode - incrementally dedupes new data as it appears
|
||||
* Largest extents first - recover more free space during fixed maintenance windows
|
||||
* Works with btrfs compression - dedupe any combination of compressed and uncompressed files
|
||||
* Works around btrfs filesystem structure to free more disk space
|
||||
* Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
|
||||
* Persistent hash table for rapid restart after shutdown
|
||||
* Whole-filesystem dedupe - including snapshots
|
||||
* Constant hash table size - no increased RAM usage if data set becomes larger
|
||||
* Works on live data - no scheduled downtime required
|
||||
* Automatic self-throttling based on system load
|
||||
* Automatic self-throttling - reduces system load
|
||||
* btrfs support - recovers more free space from btrfs than naive dedupers
|
||||
|
||||
Weaknesses
|
||||
----------
|
||||
|
||||
* Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
|
||||
* Requires root privilege (or `CAP_SYS_ADMIN`)
|
||||
* First run may require temporary disk space for extent reorganization
|
||||
* Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
|
||||
* [First run may increase metadata space usage if many snapshots exist](docs/gotchas.md)
|
||||
* Constant hash table size - no decreased RAM usage if data set becomes smaller
|
||||
* btrfs only
|
||||
@@ -46,7 +46,7 @@ Recommended Reading
|
||||
-------------------
|
||||
|
||||
* [bees Gotchas](docs/gotchas.md)
|
||||
* [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING
|
||||
* [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
|
||||
* [bees vs. other btrfs features](docs/btrfs-other.md)
|
||||
* [What to do when something goes wrong](docs/wrong.md)
|
||||
|
||||
@@ -69,6 +69,6 @@ You can also use Github:
|
||||
Copyright & License
|
||||
-------------------
|
||||
|
||||
Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
|
||||
Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.
|
||||
|
||||
GPL (version 3 or later).
|
||||
|
@@ -1,31 +1,24 @@
|
||||
Recommended Kernel Version for bees
|
||||
===================================
|
||||
Recommended Linux Kernel Version for bees
|
||||
=========================================
|
||||
|
||||
First, a warning that is not specific to bees:
|
||||
First, a warning about old Linux kernel versions:
|
||||
|
||||
> **Kernel 5.1, 5.2, and 5.3 should not be used with btrfs due to a
|
||||
severe regression that can lead to fatal metadata corruption.**
|
||||
This issue is fixed in kernel 5.4.14 and later.
|
||||
> **Linux kernel version 5.1, 5.2, and 5.3 should not be used with btrfs
|
||||
due to a severe regression that can lead to fatal metadata corruption.**
|
||||
This issue is fixed in version 5.4.14 and later.
|
||||
|
||||
**Recommended kernel versions for bees are 4.19, 5.4, 5.10, 5.11, 5.15,
|
||||
6.0, or 6.1, with recent LTS and -stable updates.** The latest released
|
||||
kernel as of this writing is 6.4.1.
|
||||
**Recommended Linux kernel versions for bees are 5.4, 5.10, 5.15, 6.1,
|
||||
6.6, or 6.12 with recent LTS and -stable updates.** The latest released
|
||||
kernel as of this writing is 6.12.9, and the earliest supported LTS
|
||||
kernel is 5.4.
|
||||
|
||||
4.14, 4.9, and 4.4 LTS kernels with recent updates are OK with some
|
||||
issues. Older kernels will be slower (a little slower or a lot slower
|
||||
depending on which issues are triggered). Not all fixes are backported.
|
||||
|
||||
Obsolete non-LTS kernels have a variety of unfixed issues and should
|
||||
not be used with btrfs. For details see the table below.
|
||||
|
||||
bees requires btrfs kernel API version 4.2 or higher, and does not work
|
||||
at all on older kernels.
|
||||
|
||||
Some bees features rely on kernel 4.15 to work, and these features will
|
||||
not be available on older kernels. Currently, bees is still usable on
|
||||
older kernels with degraded performance or with options disabled, but
|
||||
support for older kernels may be removed.
|
||||
Some optional bees features use kernel APIs introduced in kernel 4.15
|
||||
(extent scan) and 5.6 (`openat2` support). These bees features are not
|
||||
available on older kernels. Support for older kernels may be removed
|
||||
in a future bees release.
|
||||
|
||||
bees will not run at all on kernels before 4.2 due to lack of minimal
|
||||
API support.
|
||||
|
||||
|
||||
|
||||
@@ -62,6 +55,7 @@ These bugs are particularly popular among bees users, though not all are specifi
|
||||
| 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.4.125, 5.10.43, 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
|
||||
| - | 5.11 | tree mod log issue #5 | 4.4.263, 4.9.263, 4.14.227, 4.19.183, 5.4.108, 5.10.26, 5.11.9, 5.12 and later | dbcc7d57bffc btrfs: fix race when cloning extent buffer during rewind of an old root
|
||||
| - | 5.12 | tree mod log issue #6 | 4.14.233, 4.19.191, 5.4.118, 5.10.36, 5.11.20, 5.12.3, 5.13 and later | f9690f426b21 btrfs: fix race when picking most recent mod log operation for an old root
|
||||
| 5.11 | 5.12 | subvols marked for deletion with `btrfs sub del` become permanently undeletable ("ghost" subvols) | 5.12 stopped creation of new ghost subvols | Partially fixed in 8d488a8c7ba2 btrfs: fix subvolume/snapshot deletion not triggered on mount. Qu wrote a [patch](https://github.com/adam900710/linux/commit/9de990fcc8864c376eb28aa7482c54321f94acd4) to allow `btrfs sub del -i` to remove "ghost" subvols, but it was never merged upstream.
|
||||
| 4.15 | 5.16 | spurious warnings from `fs/fs-writeback.c` when `flushoncommit` is enabled | 5.15.27, 5.16.13, 5.17 and later | a0f0cf8341e3 btrfs: get rid of warning on transaction commit when using flushoncommit
|
||||
| - | 5.17 | crash during device removal can make filesystem unmountable | 5.15.54, 5.16.20, 5.17.3, 5.18 and later | bbac58698a55 btrfs: remove device item and update super block in the same transaction
|
||||
| - | 5.18 | wrong superblock num_devices makes filesystem unmountable | 4.14.283, 4.19.247, 5.4.198, 5.10.121, 5.15.46, 5.17.14, 5.18.3, 5.19 and later | d201238ccd2f btrfs: repair super block num_devices automatically
|
||||
@@ -71,7 +65,7 @@ These bugs are particularly popular among bees users, though not all are specifi
|
||||
| 6.3, backported to 5.15.107, 6.1.24, 6.2.11 | 6.3 | vmalloc error, failed to allocate pages | 6.3.10, 6.4 and later. Bug (f349b15e183d "mm: vmalloc: avoid warn_alloc noise caused by fatal signal" in v6.3-rc6) backported to 6.1.24, 6.2.11, and 5.15.107. | 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
|
||||
| 6.2 | 6.3 | `IGNORE_OFFSET` flag ignored in `LOGICAL_INO` ioctl | 6.2.16, 6.3.3, 6.4 and later | 0cad8f14d70c btrfs: fix backref walking not returning all inode refs
|
||||
| 6.10 | 6.11 | `adding refs to an existing tree ref`, `failed to run delayed ref`, then read-only | 6.11.10, 6.12 and later | 7d493a5ecc26 btrfs: fix incorrect comparison for delayed refs
|
||||
| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe ioctl on the same extent | - | workaround: avoid doing that
|
||||
| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe/clone ioctl on the same extent | - | workaround: avoid doing that
|
||||
|
||||
"Last bad kernel" refers to that version's last stable update from
|
||||
kernel.org. Distro kernels may backport additional fixes. Consult
|
||||
@@ -97,12 +91,12 @@ contains the last committed component of the fix.
|
||||
Workarounds for known kernel bugs
|
||||
---------------------------------
|
||||
|
||||
* **Hangs with concurrent `LOGICAL_INO` and dedupe**: on all
|
||||
kernel versions so far, multiple threads running `LOGICAL_INO`
|
||||
and dedupe ioctls at the same time on the same inodes or extents
|
||||
* **Hangs with concurrent `LOGICAL_INO` and dedupe/clone**: on all
|
||||
kernel versions so far, multiple threads running `LOGICAL_INO` and
|
||||
dedupe/clone ioctls at the same time on the same inodes or extents
|
||||
can lead to a kernel hang. The kernel enters an infinite loop in
|
||||
`add_all_parents`, where `count` is 0, `ref->count` is 1, and
|
||||
`btrfs_next_item` or `btrfs_next_old_item` never find a matching ref).
|
||||
`btrfs_next_item` or `btrfs_next_old_item` never find a matching ref.
|
||||
|
||||
bees has two workarounds for this bug: 1. schedule work so that multiple
|
||||
threads do not simultaneously access the same inode or the same extent,
|
||||
@@ -123,58 +117,32 @@ Workarounds for known kernel bugs
|
||||
|
||||
It is still theoretically possible to trigger the kernel bug when
|
||||
running bees at the same time as other dedupers, or other programs
|
||||
that use `LOGICAL_INO` like `btdu`; however, it's extremely difficult
|
||||
to reproduce the bug without closely cooperating threads.
|
||||
that use `LOGICAL_INO` like `btdu`, or when performing a reflink clone
|
||||
operation such as `cp` or `mv`; however, it's extremely difficult to
|
||||
reproduce the bug without closely cooperating threads.
|
||||
|
||||
* **Slow backrefs** (aka toxic extents): Under certain conditions,
|
||||
if the number of references to a single shared extent grows too
|
||||
high, the kernel consumes more and more CPU while also holding locks
|
||||
that delay write access to the filesystem. bees avoids this bug
|
||||
by measuring the time the kernel spends performing `LOGICAL_INO`
|
||||
operations and permanently blacklisting any extent or hash involved
|
||||
where the kernel starts to get slow. In the bees log, such blocks
|
||||
are labelled as 'toxic' hash/block addresses. Toxic extents are
|
||||
rare (about 1 in 100,000 extents become toxic), but toxic extents can
|
||||
become 8 orders of magnitude more expensive to process than the fastest
|
||||
non-toxic extents. This seems to affect all dedupe agents on btrfs;
|
||||
at this time of writing only bees has a workaround for this bug.
|
||||
* **Slow backrefs** (aka toxic extents): On older kernels, under certain
|
||||
conditions, if the number of references to a single shared extent grows
|
||||
too high, the kernel consumes more and more CPU while also holding
|
||||
locks that delay write access to the filesystem. This is no longer
|
||||
a concern on kernels after 5.7 (or an up-to-date 5.4 LTS version),
|
||||
but there are still some remains of earlier workarounds for this issue
|
||||
in bees that have not been fully removed.
|
||||
|
||||
This workaround is less necessary for kernels 5.4.96, 5.7 and later,
|
||||
though the bees workaround can still be triggered on newer kernels
|
||||
by changes in btrfs since kernel version 5.1.
|
||||
bees avoided this bug by measuring the time the kernel spends performing
|
||||
`LOGICAL_INO` operations and permanently blacklisting any extent or
|
||||
hash involved where the kernel starts to get slow. In the bees log,
|
||||
such blocks are labelled as 'toxic' hash/block addresses.
|
||||
|
||||
Future bees releases will remove toxic extent detection (it only detects
|
||||
false positives now) and clear all previously saved toxic extent bits.
|
||||
|
||||
* **dedupe breaks `btrfs send` in old kernels**. The bees option
|
||||
`--workaround-btrfs-send` prevents any modification of read-only subvols
|
||||
in order to avoid breaking `btrfs send`.
|
||||
in order to avoid breaking `btrfs send` on kernels before 5.2.
|
||||
|
||||
This workaround is no longer necessary to avoid kernel crashes
|
||||
and send performance failure on kernel 4.9.207, 4.14.159, 4.19.90,
|
||||
5.3.17, 5.4.4, 5.5 and later; however, some conflict between send
|
||||
and dedupe still remains, so the workaround is still useful.
|
||||
This workaround is no longer necessary to avoid kernel crashes and
|
||||
send performance failure on kernel 5.4.4 and later. bees will pause
|
||||
dedupe until the send is finished on current kernels.
|
||||
|
||||
`btrfs receive` is not and has never been affected by this issue.
|
||||
|
||||
Unfixed kernel bugs
|
||||
-------------------
|
||||
|
||||
* **The kernel does not permit `btrfs send` and dedupe to run at the
|
||||
same time**. Recent kernels no longer crash, but now refuse one
|
||||
operation with an error if the other operation was already running.
|
||||
|
||||
bees has not been updated to handle the new dedupe behavior optimally.
|
||||
Optimal behavior is to defer dedupe operations when send is detected,
|
||||
and resume after the send is finished. Current bees behavior is to
|
||||
complain loudly about each individual dedupe failure in log messages,
|
||||
and abandon duplicate data references in the snapshot that send is
|
||||
processing. A future bees version shall have better handling for
|
||||
this situation.
|
||||
|
||||
Workaround: send `SIGSTOP` to bees, or terminate the bees process,
|
||||
before running `btrfs send`.
|
||||
|
||||
This workaround is not strictly required if snapshot is deleted after
|
||||
sending. In that case, any duplicate data blocks that were not removed
|
||||
by dedupe will be removed by snapshot delete instead. The workaround
|
||||
still saves some IO.
|
||||
|
||||
`btrfs receive` is not affected by this issue.
|
||||
|
@@ -3,40 +3,34 @@ Good Btrfs Feature Interactions
|
||||
|
||||
bees has been tested in combination with the following:
|
||||
|
||||
* btrfs compression (zlib, lzo, zstd), mixtures of compressed and uncompressed extents
|
||||
* btrfs compression (zlib, lzo, zstd)
|
||||
* PREALLOC extents (unconditionally replaced with holes)
|
||||
* HOLE extents and btrfs no-holes feature
|
||||
* Other deduplicators, reflink copies (though bees may decide to redo their work)
|
||||
* btrfs snapshots and non-snapshot subvols (RW and RO)
|
||||
* Other deduplicators (`duperemove`, `jdupes`)
|
||||
* Reflink copies (modern coreutils `cp` and `mv`)
|
||||
* Concurrent file modification (e.g. PostgreSQL and sqlite databases, VMs, build daemons)
|
||||
* All btrfs RAID profiles
|
||||
* IO errors during dedupe (read errors will throw exceptions, bees will catch them and skip over the affected extent)
|
||||
* Filesystems mounted with or without the `flushoncommit` option
|
||||
* All btrfs RAID profiles: single, dup, raid0, raid1, raid10, raid1c3, raid1c4, raid5, raid6
|
||||
* IO errors during dedupe (affected extents are skipped)
|
||||
* 4K filesystem data block size / clone alignment
|
||||
* 64-bit and 32-bit LE host CPUs (amd64, x86, arm)
|
||||
* Large files (kernel 5.4 or later strongly recommended)
|
||||
* Filesystems up to 90T+ bytes, 1000M+ files
|
||||
* Filesystem data sizes up to 100T+ bytes, 1000M+ files
|
||||
* `open(O_DIRECT)` (seems to work as well--or as poorly--with bees as with any other btrfs feature)
|
||||
* btrfs-convert from ext2/3/4
|
||||
* btrfs `autodefrag` mount option
|
||||
* btrfs balance (data balances cause rescan of relocated data)
|
||||
* btrfs block-group-tree
|
||||
* btrfs `flushoncommit` and `noflushoncommit` mount options
|
||||
* btrfs mixed block groups
|
||||
* btrfs `nodatacow`/`nodatasum` inode attribute or mount option (bees skips all nodatasum files)
|
||||
* btrfs qgroups and quota support (_not_ squotas)
|
||||
* btrfs receive
|
||||
* btrfs nodatacow/nodatasum inode attribute or mount option (bees skips all nodatasum files)
|
||||
* open(O_DIRECT) (seems to work as well--or as poorly--with bees as with any other btrfs feature)
|
||||
* lvm dm-cache, writecache
|
||||
* btrfs scrub
|
||||
* btrfs send (dedupe pauses automatically, kernel 5.4 or later required)
|
||||
* btrfs snapshot, non-snapshot subvols (RW and RO), snapshot delete
|
||||
|
||||
Bad Btrfs Feature Interactions
|
||||
------------------------------
|
||||
|
||||
bees has been tested in combination with the following, and various problems are known:
|
||||
|
||||
* btrfs send: there are bugs in `btrfs send` that can be triggered by
|
||||
bees on old kernels. The [`--workaround-btrfs-send` option](options.md)
|
||||
works around this issue by preventing bees from modifying read-only
|
||||
snapshots.
|
||||
|
||||
* btrfs qgroups: very slow, sometimes hangs...and it's even worse when
|
||||
bees is running.
|
||||
|
||||
* btrfs autodefrag mount option: bees cannot distinguish autodefrag
|
||||
activity from normal filesystem activity, and may try to undo the
|
||||
autodefrag if duplicate copies of the defragmented data exist.
|
||||
**Note:** some btrfs features have minimum kernel versions which are
|
||||
higher than the minimum kernel version for bees.
|
||||
|
||||
Untested Btrfs Feature Interactions
|
||||
-----------------------------------
|
||||
@@ -45,10 +39,6 @@ bees has not been tested with the following, and undesirable interactions may oc
|
||||
|
||||
* Non-4K filesystem data block size (should work if recompiled)
|
||||
* Non-equal hash (SUM) and filesystem data block (CLONE) sizes (need to fix that eventually)
|
||||
* btrfs seed filesystems (no particular reason it wouldn't work, but no one has reported trying)
|
||||
* btrfs out-of-tree kernel patches (e.g. in-kernel dedupe, encryption, extent tree v2)
|
||||
* btrfs-convert from ext2/3/4 (never tested, might run out of space or ignore significant portions of the filesystem due to sanity checks)
|
||||
* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
|
||||
* btrfs seed filesystems, raid-stripe-tree, squotas (no particular reason these wouldn't work, but no one has reported trying)
|
||||
* btrfs out-of-tree kernel patches (e.g. encryption, extent tree v2)
|
||||
* Host CPUs with exotic page sizes, alignment requirements, or endianness (ppc, alpha, sparc, strongarm, s390, mips, m68k...)
|
||||
* bcache: used to be in the "bad" list, now in the "untested" list because nobody is rigorously testing, and bcache bugs come and go
|
||||
* flashcache: an out-of-tree cache-HDD-on-SSD block layer helper
|
||||
|
@@ -26,11 +26,7 @@ Here are some numbers to estimate appropriate hash table sizes:
|
||||
Notes:
|
||||
|
||||
* If the hash table is too large, no extra dedupe efficiency is
|
||||
obtained, and the extra space wastes RAM. If the hash table contains
|
||||
more block records than there are blocks in the filesystem, the extra
|
||||
space can slow bees down. A table that is too large prevents obsolete
|
||||
data from being evicted, so bees wastes time looking for matching data
|
||||
that is no longer present on the filesystem.
|
||||
obtained, and the extra space wastes RAM.
|
||||
|
||||
* If the hash table is too small, bees extrapolates from matching
|
||||
blocks to find matching adjacent blocks in the filesystem that have been
|
||||
@@ -59,19 +55,19 @@ patterns on dedupe effectiveness without performing deep inspection of
|
||||
both the filesystem data and its structure--a task that is as expensive
|
||||
as performing the deduplication.
|
||||
|
||||
* **Compression** on the filesystem reduces the average extent length
|
||||
compared to uncompressed filesystems. The maximum compressed extent
|
||||
length on btrfs is 128KB, while the maximum uncompressed extent length
|
||||
is 128MB. Longer extents decrease the optimum hash table size while
|
||||
shorter extents increase the optimum hash table size because the
|
||||
probability of a hash table entry being present (i.e. unevicted) in
|
||||
each extent is proportional to the extent length.
|
||||
* **Compression** in files reduces the average extent length compared
|
||||
to uncompressed files. The maximum compressed extent length on
|
||||
btrfs is 128KB, while the maximum uncompressed extent length is 128MB.
|
||||
Longer extents decrease the optimum hash table size while shorter extents
|
||||
increase the optimum hash table size, because the probability of a hash
|
||||
table entry being present (i.e. unevicted) in each extent is proportional
|
||||
to the extent length.
|
||||
|
||||
As a rule of thumb, the optimal hash table size for a compressed
|
||||
filesystem is 2-4x larger than the optimal hash table size for the same
|
||||
data on an uncompressed filesystem. Dedupe efficiency falls dramatically
|
||||
with hash tables smaller than 128MB/TB as the average dedupe extent size
|
||||
is larger than the largest possible compressed extent size (128KB).
|
||||
data on an uncompressed filesystem. Dedupe efficiency falls rapidly with
|
||||
hash tables smaller than 128MB/TB as the average dedupe extent size is
|
||||
larger than the largest possible compressed extent size (128KB).
|
||||
|
||||
* **Short writes or fragmentation** also shorten the average extent
|
||||
length and increase optimum hash table size. If a database writes to
|
||||
@@ -115,7 +111,6 @@ Extent scan mode:
|
||||
* Works with 4.15 and later kernels.
|
||||
* Can estimate progress and provide an ETA.
|
||||
* Can optimize scanning order to dedupe large extents first.
|
||||
* Cannot avoid modifying read-only subvols.
|
||||
* Can keep up with frequent creation and deletion of snapshots.
|
||||
|
||||
Subvol scan modes:
|
||||
@@ -123,8 +118,7 @@ Subvol scan modes:
|
||||
* Work with 4.14 and earlier kernels.
|
||||
* Cannot estimate or report progress.
|
||||
* Cannot optimize scanning order by extent size.
|
||||
* Can avoid modifying read-only subvols (for `btrfs send` workaround).
|
||||
* Have problems keeping up with snapshots created during a scan.
|
||||
* Have problems keeping up with multiple snapshots created during a scan.
|
||||
|
||||
The default scan mode is 4, "extent".
|
||||
|
||||
@@ -212,7 +206,7 @@ Extent scan mode
|
||||
Scan mode 4, "extent", scans the extent tree instead of the subvol trees.
|
||||
Extent scan mode reads each extent once, regardless of the number of
|
||||
reflinks or snapshots. It adapts to the creation of new snapshots
|
||||
immediately, without having to revisit old data.
|
||||
and reflinks immediately, without having to revisit old data.
|
||||
|
||||
In the extent scan mode, extents are separated into multiple size tiers
|
||||
to prioritize large extents over small ones. Deduping large extents
|
||||
@@ -268,17 +262,54 @@ send` in extent scan mode, and restart bees after the `send` is complete.
|
||||
Threads and load management
|
||||
---------------------------
|
||||
|
||||
By default, bees creates one worker thread for each CPU detected.
|
||||
These threads then perform scanning and dedupe operations. The number of
|
||||
worker threads can be set with the [`--thread-count` and `--thread-factor`
|
||||
options](options.md).
|
||||
By default, bees creates one worker thread for each CPU detected. These
|
||||
threads then perform scanning and dedupe operations. bees attempts to
|
||||
maximize the amount of productive work each thread does, until either the
|
||||
threads are all continuously busy, or there is no remaining work to do.
|
||||
|
||||
If desired, bees can automatically increase or decrease the number
|
||||
of worker threads in response to system load. This reduces impact on
|
||||
the rest of the system by pausing bees when other CPU and IO intensive
|
||||
loads are active on the system, and resumes bees when the other loads
|
||||
are inactive. This is configured with the [`--loadavg-target` and
|
||||
`--thread-min` options](options.md).
|
||||
In many cases it is not desirable to continually run bees at maximum
|
||||
performance. Maximum performance is not necessary if bees can dedupe
|
||||
new data faster than it appears on the filesystem. If it only takes
|
||||
bees 10 minutes per day to dedupe all new data on a filesystem, then
|
||||
bees doesn't need to run for more than 10 minutes per day.
|
||||
|
||||
bees supports a number of options for reducing system load:
|
||||
|
||||
* Run bees for a few hours per day, at an off-peak time (i.e. during
|
||||
a maintenace window), instead of running bees continuously. Any data
|
||||
added to the filesystem while bees is not running will be scanned when
|
||||
bees restarts. At the end of the maintenance window, terminate the
|
||||
bees process with SIGTERM to write the hash table and scan position
|
||||
for the next maintenance window.
|
||||
|
||||
* Temporarily pause bees operation by sending the bees process SIGUSR1,
|
||||
and resume operation with SIGUSR2. This is preferable to freezing
|
||||
and thawing the process, e.g. with freezer cgroups or SIGSTOP/SIGCONT
|
||||
signals, because it allows bees to close open file handles that would
|
||||
otherwise prevent those files from being deleted while bees is frozen.
|
||||
|
||||
* Reduce the number of worker threads with the [`--thread-count` or
|
||||
`--thread-factor` options](options.md). This simply leaves CPU cores
|
||||
idle so that other applications on the host can use them, or to save
|
||||
power.
|
||||
|
||||
* Allow bees to automatically track system load and increase or decrease
|
||||
the number of threads to reach a target system load. This reduces
|
||||
impact on the rest of the system by pausing bees when other CPU and IO
|
||||
intensive loads are active on the system, and resumes bees when the other
|
||||
loads are inactive. This is configured with the [`--loadavg-target`
|
||||
and `--thread-min` options](options.md).
|
||||
|
||||
* Allow bees to self-throttle operations that enqueue delayed work
|
||||
within btrfs. These operations are not well controlled by Linux
|
||||
features such as process priority or IO priority or IO rate-limiting,
|
||||
because the enqueued work is submitted to btrfs several seconds before
|
||||
btrfs performs the work. By the time btrfs performs the work, it's too
|
||||
late for external throttling to be effective. The [`--throttle-factor`
|
||||
option](options.md) tracks how long it takes btrfs to complete queued
|
||||
operations, and reduces bees's queued work submission rate to match
|
||||
btrfs's queued work completion rate (or a fraction thereof, to reduce
|
||||
system load).
|
||||
|
||||
Log verbosity
|
||||
-------------
|
||||
|
@@ -120,13 +120,14 @@ The `crawl` event group consists of operations related to scanning btrfs trees t
|
||||
|
||||
* `crawl_again`: An inode crawl was restarted because the extent was already locked by another running crawl.
|
||||
* `crawl_blacklisted`: An extent was not scanned because it belongs to a blacklisted file.
|
||||
* `crawl_create`: A new subvol or extent crawler was created.
|
||||
* `crawl_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
|
||||
* `crawl_done`: One pass over a subvol was completed.
|
||||
* `crawl_discard`: An extent that didn't match the crawler's size tier was discarded.
|
||||
* `crawl_discard_high`: An extent that was too large for the crawler's size tier was discarded.
|
||||
* `crawl_discard_low`: An extent that was too small for the crawler's size tier was discarded.
|
||||
* `crawl_empty`: A `TREE_SEARCH_V2` ioctl call failed or returned an empty set (usually because all data in the subvol was scanned).
|
||||
* `crawl_extent`: The extent crawler queued all references to an extent for processing.
|
||||
* `crawl_fail`: A `TREE_SEARCH_V2` ioctl call failed.
|
||||
* `crawl_flop`: Small extent items were not skipped because the next extent started at or before the end of the previous extent.
|
||||
* `crawl_gen_high`: An extent item in the search results refers to an extent that is newer than the current crawl's `max_transid` allows.
|
||||
* `crawl_gen_low`: An extent item in the search results refers to an extent that is older than the current crawl's `min_transid` allows.
|
||||
* `crawl_hole`: An extent item in the search results refers to a hole.
|
||||
@@ -138,6 +139,8 @@ The `crawl` event group consists of operations related to scanning btrfs trees t
|
||||
* `crawl_prealloc`: An extent item in the search results refers to a `PREALLOC` extent.
|
||||
* `crawl_push`: An extent item in the search results is suitable for scanning and deduplication.
|
||||
* `crawl_scan`: An extent item in the search results is submitted to `BeesContext::scan_forward` for scanning and deduplication.
|
||||
* `crawl_skip`: Small extent items were skipped because no extent of sufficient size was found within the minimum search distance.
|
||||
* `crawl_skip_ms`: Time spent skipping small extent items.
|
||||
* `crawl_search`: A `TREE_SEARCH_V2` ioctl call was successful.
|
||||
* `crawl_throttled`: Extent scan created too many work queue items and was prevented from creating any more.
|
||||
* `crawl_tree_block`: Extent scan found and skipped a metadata tree block.
|
||||
@@ -281,11 +284,14 @@ The `progress` event group consists of events related to progress estimation.
|
||||
readahead
|
||||
---------
|
||||
|
||||
The `readahead` event group consists of events related to calls to `posix_fadvise`.
|
||||
The `readahead` event group consists of events related to data prefetching (formerly calls to `posix_fadvise` or `readahead`, but now emulated in userspace).
|
||||
|
||||
* `readahead_bytes`: Number of bytes prefetched.
|
||||
* `readahead_count`: Number of read calls.
|
||||
* `readahead_clear`: Number of times the duplicate read cache was cleared.
|
||||
* `readahead_skip`: Number of times a duplicate read was identified in the cache and skipped.
|
||||
* `readahead_fail`: Number of read errors during prefetch.
|
||||
* `readahead_ms`: Total time spent emulating readahead in user-space (kernel readahead is not measured).
|
||||
* `readahead_skip`: Number of times a duplicate read was identified in the cache and skipped.
|
||||
* `readahead_unread_ms`: Total time spent running `posix_fadvise(..., POSIX_FADV_DONTNEED)`.
|
||||
|
||||
replacedst
|
||||
|
@@ -6,30 +6,30 @@ Best-Effort Extent-Same, a btrfs deduplication agent.
|
||||
About bees
|
||||
----------
|
||||
|
||||
bees is a block-oriented userspace deduplication agent designed for large
|
||||
btrfs filesystems. It is an offline dedupe combined with an incremental
|
||||
data scan capability to minimize time data spends on disk from write
|
||||
to dedupe.
|
||||
bees is a block-oriented userspace deduplication agent designed to scale
|
||||
up to large btrfs filesystems. It is an offline dedupe combined with
|
||||
an incremental data scan capability to minimize time data spends on disk
|
||||
from write to dedupe.
|
||||
|
||||
Strengths
|
||||
---------
|
||||
|
||||
* Space-efficient hash table and matching algorithms - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
|
||||
* Daemon incrementally dedupes new data using btrfs tree search
|
||||
* Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
|
||||
* Daemon mode - incrementally dedupes new data as it appears
|
||||
* Largest extents first - recover more free space during fixed maintenance windows
|
||||
* Works with btrfs compression - dedupe any combination of compressed and uncompressed files
|
||||
* Works around btrfs filesystem structure to free more disk space
|
||||
* Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
|
||||
* Persistent hash table for rapid restart after shutdown
|
||||
* Whole-filesystem dedupe - including snapshots
|
||||
* Constant hash table size - no increased RAM usage if data set becomes larger
|
||||
* Works on live data - no scheduled downtime required
|
||||
* Automatic self-throttling based on system load
|
||||
* Automatic self-throttling - reduces system load
|
||||
* btrfs support - recovers more free space from btrfs than naive dedupers
|
||||
|
||||
Weaknesses
|
||||
----------
|
||||
|
||||
* Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
|
||||
* Requires root privilege (or `CAP_SYS_ADMIN`)
|
||||
* First run may require temporary disk space for extent reorganization
|
||||
* Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
|
||||
* [First run may increase metadata space usage if many snapshots exist](gotchas.md)
|
||||
* Constant hash table size - no decreased RAM usage if data set becomes smaller
|
||||
* btrfs only
|
||||
@@ -46,7 +46,7 @@ Recommended Reading
|
||||
-------------------
|
||||
|
||||
* [bees Gotchas](gotchas.md)
|
||||
* [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING
|
||||
* [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
|
||||
* [bees vs. other btrfs features](btrfs-other.md)
|
||||
* [What to do when something goes wrong](wrong.md)
|
||||
|
||||
@@ -69,6 +69,6 @@ You can also use Github:
|
||||
Copyright & License
|
||||
-------------------
|
||||
|
||||
Copyright 2015-2023 Zygo Blaxell <bees@furryterror.org>.
|
||||
Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.
|
||||
|
||||
GPL (version 3 or later).
|
||||
|
@@ -84,19 +84,22 @@
|
||||
|
||||
* `--workaround-btrfs-send` or `-a`
|
||||
|
||||
_This option is obsolete and should not be used any more._
|
||||
|
||||
Pretend that read-only snapshots are empty and silently discard any
|
||||
request to dedupe files referenced through them. This is a workaround for
|
||||
[problems with the kernel implementation of `btrfs send` and `btrfs send
|
||||
request to dedupe files referenced through them. This is a workaround
|
||||
for [problems with old kernels running `btrfs send` and `btrfs send
|
||||
-p`](btrfs-kernel.md) which make these btrfs features unusable with bees.
|
||||
|
||||
This option should be used to avoid breaking `btrfs send` on the same
|
||||
filesystem.
|
||||
This option was used to avoid breaking `btrfs send` on old kernels.
|
||||
The affected kernels are now too old to be recommended for use with bees.
|
||||
|
||||
bees now waits for `btrfs send` to finish. There is no need for an
|
||||
option to enable this.
|
||||
|
||||
**Note:** There is a _significant_ space tradeoff when using this option:
|
||||
it is likely no space will be recovered--and possibly significant extra
|
||||
space used--until the read-only snapshots are deleted. On the other
|
||||
hand, if snapshots are rotated frequently then bees will spend less time
|
||||
scanning them.
|
||||
space used--until the read-only snapshots are deleted.
|
||||
|
||||
## Logging options
|
||||
|
||||
|
@@ -4,16 +4,13 @@ What to do when something goes wrong with bees
|
||||
Hangs and excessive slowness
|
||||
----------------------------
|
||||
|
||||
### Are you using qgroups or autodefrag?
|
||||
|
||||
Read about [bad btrfs feature interactions](btrfs-other.md).
|
||||
|
||||
### Use load-throttling options
|
||||
|
||||
If bees is just more aggressive than you would like, consider using
|
||||
[load throttling options](options.md). These are usually more effective
|
||||
than `ionice`, `schedtool`, and the `blkio` cgroup (though you can
|
||||
certainly use those too).
|
||||
certainly use those too) because they limit work that bees queues up
|
||||
for later execution inside btrfs.
|
||||
|
||||
### Check `$BEESSTATUS`
|
||||
|
||||
@@ -52,10 +49,6 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li
|
||||
|
||||
Thread names of note:
|
||||
|
||||
* `crawl_12345`: scan/dedupe worker threads (the number is the subvol
|
||||
ID which the thread is currently working on). These threads appear
|
||||
and disappear from the status dynamically according to the requirements
|
||||
of the work queue and loadavg throttling.
|
||||
* `bees`: main thread (doesn't do anything after startup, but its task execution time is that of the whole bees process)
|
||||
* `crawl_master`: task that finds new extents in the filesystem and populates the work queue
|
||||
* `crawl_transid`: btrfs transid (generation number) tracker and polling thread
|
||||
@@ -64,6 +57,13 @@ dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/li
|
||||
* `hash_writeback`: trickle-writes the hash table back to `beeshash.dat`
|
||||
* `hash_prefetch`: prefetches the hash table at startup and updates `beesstats.txt` hourly
|
||||
|
||||
Most other threads have names that are derived from the current dedupe
|
||||
task that they are executing:
|
||||
|
||||
* `ref_205ad76b1000_24K_50`: extent scan performing dedupe of btrfs extent bytenr `205ad76b1000`, which is 24 KiB long and has 50 references
|
||||
* `extent_250_32M_16E`: extent scan searching for extents between 32 MiB + 1 and 16 EiB bytes long, tracking scan position in virtual subvol `250`.
|
||||
* `crawl_378_18916`: subvol scan searching for extent refs in subvol `378`, inode `18916`.
|
||||
|
||||
### Dump kernel stacks of hung processes
|
||||
|
||||
Check the kernel stacks of all blocked kernel processes:
|
||||
@@ -91,7 +91,7 @@ bees Crashes
|
||||
(gdb) thread apply all bt full
|
||||
|
||||
The last line generates megabytes of output and will often crash gdb.
|
||||
This is OK, submit whatever output gdb can produce.
|
||||
Submit whatever output gdb can produce.
|
||||
|
||||
**Note that this output may include filenames or data from your
|
||||
filesystem.**
|
||||
@@ -160,8 +160,7 @@ Kernel crashes, corruption, and filesystem damage
|
||||
-------------------------------------------------
|
||||
|
||||
bees doesn't do anything that _should_ cause corruption or data loss;
|
||||
however, [btrfs has kernel bugs](btrfs-kernel.md) and [interacts poorly
|
||||
with some Linux block device layers](btrfs-other.md), so corruption is
|
||||
however, [btrfs has kernel bugs](btrfs-kernel.md), so corruption is
|
||||
not impossible.
|
||||
|
||||
Issues with the btrfs filesystem kernel code or other block device layers
|
||||
|
@@ -173,34 +173,42 @@ namespace crucible {
|
||||
void get_sums(uint64_t logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t count)> output);
|
||||
};
|
||||
|
||||
/// Fetch extent items from extent tree
|
||||
/// Fetch extent items from extent tree.
|
||||
/// Does not filter out metadata! See BtrfsDataExtentTreeFetcher for that.
|
||||
class BtrfsExtentItemFetcher : public BtrfsTreeObjectFetcher {
|
||||
public:
|
||||
BtrfsExtentItemFetcher(const Fd &fd);
|
||||
};
|
||||
|
||||
/// Fetch extent refs from an inode
|
||||
/// Fetch extent refs from an inode. Caller must set the tree and objectid.
|
||||
class BtrfsExtentDataFetcher : public BtrfsTreeOffsetFetcher {
|
||||
public:
|
||||
BtrfsExtentDataFetcher(const Fd &fd);
|
||||
};
|
||||
|
||||
/// Fetch inodes from a subvol
|
||||
class BtrfsFsTreeFetcher : public BtrfsTreeObjectFetcher {
|
||||
public:
|
||||
BtrfsFsTreeFetcher(const Fd &fd, uint64_t subvol);
|
||||
};
|
||||
|
||||
/// Fetch raw inode items
|
||||
class BtrfsInodeFetcher : public BtrfsTreeObjectFetcher {
|
||||
public:
|
||||
BtrfsInodeFetcher(const Fd &fd);
|
||||
BtrfsTreeItem stat(uint64_t subvol, uint64_t inode);
|
||||
};
|
||||
|
||||
/// Fetch a root (subvol) item
|
||||
class BtrfsRootFetcher : public BtrfsTreeObjectFetcher {
|
||||
public:
|
||||
BtrfsRootFetcher(const Fd &fd);
|
||||
BtrfsTreeItem root(uint64_t subvol);
|
||||
BtrfsTreeItem root_backref(uint64_t subvol);
|
||||
};
|
||||
|
||||
/// Fetch data extent items from extent tree, skipping metadata-only block groups
|
||||
class BtrfsDataExtentTreeFetcher : public BtrfsExtentItemFetcher {
|
||||
BtrfsTreeItem m_current_bg;
|
||||
BtrfsTreeOffsetFetcher m_chunk_tree;
|
||||
protected:
|
||||
virtual void next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr) override;
|
||||
public:
|
||||
BtrfsDataExtentTreeFetcher(const Fd &fd);
|
||||
};
|
||||
|
||||
}
|
||||
|
@@ -78,9 +78,6 @@ enum btrfs_compression_type {
|
||||
#define BTRFS_SHARED_BLOCK_REF_KEY 182
|
||||
#define BTRFS_SHARED_DATA_REF_KEY 184
|
||||
#define BTRFS_BLOCK_GROUP_ITEM_KEY 192
|
||||
#define BTRFS_FREE_SPACE_INFO_KEY 198
|
||||
#define BTRFS_FREE_SPACE_EXTENT_KEY 199
|
||||
#define BTRFS_FREE_SPACE_BITMAP_KEY 200
|
||||
#define BTRFS_DEV_EXTENT_KEY 204
|
||||
#define BTRFS_DEV_ITEM_KEY 216
|
||||
#define BTRFS_CHUNK_ITEM_KEY 228
|
||||
@@ -97,6 +94,18 @@ enum btrfs_compression_type {
|
||||
|
||||
#endif
|
||||
|
||||
#ifndef BTRFS_FREE_SPACE_INFO_KEY
|
||||
#define BTRFS_FREE_SPACE_INFO_KEY 198
|
||||
#define BTRFS_FREE_SPACE_EXTENT_KEY 199
|
||||
#define BTRFS_FREE_SPACE_BITMAP_KEY 200
|
||||
#define BTRFS_FREE_SPACE_OBJECTID -11ULL
|
||||
#endif
|
||||
|
||||
#ifndef BTRFS_BLOCK_GROUP_RAID1C4
|
||||
#define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
|
||||
#define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
|
||||
#endif
|
||||
|
||||
#ifndef BTRFS_DEFRAG_RANGE_START_IO
|
||||
|
||||
// For some reason uapi has BTRFS_DEFRAG_RANGE_COMPRESS and
|
||||
|
@@ -201,11 +201,13 @@ namespace crucible {
|
||||
static thread_local size_t s_calls;
|
||||
static thread_local size_t s_loops;
|
||||
static thread_local size_t s_loops_empty;
|
||||
static thread_local shared_ptr<ostream> s_debug_ostream;
|
||||
};
|
||||
|
||||
ostream & operator<<(ostream &os, const btrfs_ioctl_search_key &key);
|
||||
ostream & operator<<(ostream &os, const BtrfsIoctlSearchKey &key);
|
||||
|
||||
string btrfs_chunk_type_ntoa(uint64_t type);
|
||||
string btrfs_search_type_ntoa(unsigned type);
|
||||
string btrfs_search_objectid_ntoa(uint64_t objectid);
|
||||
string btrfs_compress_type_ntoa(uint8_t type);
|
||||
@@ -246,9 +248,11 @@ namespace crucible {
|
||||
struct BtrfsIoctlFsInfoArgs : public btrfs_ioctl_fs_info_args_v3 {
|
||||
BtrfsIoctlFsInfoArgs();
|
||||
void do_ioctl(int fd);
|
||||
bool do_ioctl_nothrow(int fd);
|
||||
uint16_t csum_type() const;
|
||||
uint16_t csum_size() const;
|
||||
uint64_t generation() const;
|
||||
vector<uint8_t> fsid() const;
|
||||
};
|
||||
|
||||
ostream & operator<<(ostream &os, const BtrfsIoctlFsInfoArgs &a);
|
||||
|
@@ -13,7 +13,7 @@ namespace crucible {
|
||||
hexdump(ostream &os, const V &v)
|
||||
{
|
||||
const auto v_size = v.size();
|
||||
const uint8_t* const v_data = reinterpret_cast<uint8_t*>(v.data());
|
||||
const uint8_t* const v_data = reinterpret_cast<const uint8_t*>(v.data());
|
||||
os << "V { size = " << v_size << ", data:\n";
|
||||
for (size_t i = 0; i < v_size; i += 8) {
|
||||
string hex, ascii;
|
||||
|
@@ -117,7 +117,7 @@ namespace crucible {
|
||||
while (full() || locked(name)) {
|
||||
m_condvar.wait(lock);
|
||||
}
|
||||
auto rv = m_set.insert(make_pair(name, crucible::gettid()));
|
||||
auto rv = m_set.insert(make_pair(name, gettid()));
|
||||
THROW_CHECK0(runtime_error, rv.second);
|
||||
}
|
||||
|
||||
@@ -129,7 +129,7 @@ namespace crucible {
|
||||
if (full() || locked(name)) {
|
||||
return false;
|
||||
}
|
||||
auto rv = m_set.insert(make_pair(name, crucible::gettid()));
|
||||
auto rv = m_set.insert(make_pair(name, gettid()));
|
||||
THROW_CHECK1(runtime_error, name, rv.second);
|
||||
return true;
|
||||
}
|
||||
|
52
include/crucible/openat2.h
Normal file
52
include/crucible/openat2.h
Normal file
@@ -0,0 +1,52 @@
|
||||
#ifndef CRUCIBLE_OPENAT2_H
|
||||
#define CRUCIBLE_OPENAT2_H
|
||||
|
||||
#include <cstdlib>
|
||||
|
||||
// Compatibility for building on old libc for new kernel
|
||||
#include <linux/version.h>
|
||||
|
||||
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0)
|
||||
|
||||
#include <linux/openat2.h>
|
||||
|
||||
#else
|
||||
|
||||
#include <linux/types.h>
|
||||
|
||||
#ifndef RESOLVE_NO_XDEV
|
||||
#define RESOLVE_NO_XDEV 1
|
||||
|
||||
// RESOLVE_NO_XDEV was there from the beginning of openat2,
|
||||
// so if that's missing, so is open_how
|
||||
|
||||
struct open_how {
|
||||
__u64 flags;
|
||||
__u64 mode;
|
||||
__u64 resolve;
|
||||
};
|
||||
#endif
|
||||
|
||||
#ifndef RESOLVE_NO_MAGICLINKS
|
||||
#define RESOLVE_NO_MAGICLINKS 2
|
||||
#endif
|
||||
#ifndef RESOLVE_NO_SYMLINKS
|
||||
#define RESOLVE_NO_SYMLINKS 4
|
||||
#endif
|
||||
#ifndef RESOLVE_BENEATH
|
||||
#define RESOLVE_BENEATH 8
|
||||
#endif
|
||||
#ifndef RESOLVE_IN_ROOT
|
||||
#define RESOLVE_IN_ROOT 16
|
||||
#endif
|
||||
|
||||
#endif // Linux version >= v5.6
|
||||
|
||||
extern "C" {
|
||||
|
||||
/// Weak symbol to support libc with no syscall wrapper
|
||||
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size) throw();
|
||||
|
||||
};
|
||||
|
||||
#endif // CRUCIBLE_OPENAT2_H
|
@@ -10,6 +10,10 @@
|
||||
#include <sys/wait.h>
|
||||
#include <unistd.h>
|
||||
|
||||
extern "C" {
|
||||
pid_t gettid() throw();
|
||||
};
|
||||
|
||||
namespace crucible {
|
||||
using namespace std;
|
||||
|
||||
@@ -73,7 +77,6 @@ namespace crucible {
|
||||
|
||||
typedef ResourceHandle<Process::id, Process> Pid;
|
||||
|
||||
pid_t gettid();
|
||||
double getloadavg1();
|
||||
double getloadavg5();
|
||||
double getloadavg15();
|
||||
|
@@ -6,23 +6,23 @@
|
||||
#include <algorithm>
|
||||
#include <limits>
|
||||
|
||||
#include <cstdint>
|
||||
|
||||
#if 1
|
||||
// Debug stream
|
||||
#include <memory>
|
||||
#include <iostream>
|
||||
#include <sstream>
|
||||
#define DINIT(__x) __x
|
||||
#define DLOG(__x) do { logs << __x << std::endl; } while (false)
|
||||
#define DOUT(__err) do { __err << logs.str(); } while (false)
|
||||
#else
|
||||
#define DINIT(__x) do {} while (false)
|
||||
#define DLOG(__x) do {} while (false)
|
||||
#define DOUT(__x) do {} while (false)
|
||||
#endif
|
||||
|
||||
#include <cstdint>
|
||||
|
||||
namespace crucible {
|
||||
using namespace std;
|
||||
|
||||
extern thread_local shared_ptr<ostream> tl_seeker_debug_str;
|
||||
#define SEEKER_DEBUG_LOG(__x) do { \
|
||||
if (tl_seeker_debug_str) { \
|
||||
(*tl_seeker_debug_str) << __x << "\n"; \
|
||||
} \
|
||||
} while (false)
|
||||
|
||||
// Requirements for Container<Pos> Fetch(Pos lower, Pos upper):
|
||||
// - fetches objects in Pos order, starting from lower (must be >= lower)
|
||||
// - must return upper if present, may or may not return objects after that
|
||||
@@ -49,113 +49,108 @@ namespace crucible {
|
||||
Pos
|
||||
seek_backward(Pos const target_pos, Fetch fetch, Pos min_step = 1, size_t max_loops = numeric_limits<size_t>::max())
|
||||
{
|
||||
DINIT(ostringstream logs);
|
||||
try {
|
||||
static const Pos end_pos = numeric_limits<Pos>::max();
|
||||
// TBH this probably won't work if begin_pos != 0, i.e. any signed type
|
||||
static const Pos begin_pos = numeric_limits<Pos>::min();
|
||||
// Run a binary search looking for the highest key below target_pos.
|
||||
// Initial upper bound of the search is target_pos.
|
||||
// Find initial lower bound by doubling the size of the range until a key below target_pos
|
||||
// is found, or the lower bound reaches the beginning of the search space.
|
||||
// If the lower bound search reaches the beginning of the search space without finding a key,
|
||||
// return the beginning of the search space; otherwise, perform a binary search between
|
||||
// the bounds now established.
|
||||
Pos lower_bound = 0;
|
||||
Pos upper_bound = target_pos;
|
||||
bool found_low = false;
|
||||
Pos probe_pos = target_pos;
|
||||
// We need one loop for each bit of the search space to find the lower bound,
|
||||
// one loop for each bit of the search space to find the upper bound,
|
||||
// and one extra loop to confirm the boundary is correct.
|
||||
for (size_t loop_count = min(numeric_limits<Pos>::digits * size_t(2) + 1, max_loops); loop_count; --loop_count) {
|
||||
DLOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
|
||||
auto result = fetch(probe_pos, target_pos);
|
||||
const Pos low_pos = result.empty() ? end_pos : *result.begin();
|
||||
const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
|
||||
DLOG(" = " << low_pos << ".." << high_pos);
|
||||
// check for correct behavior of the fetch function
|
||||
THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
|
||||
THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
|
||||
THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
|
||||
if (!found_low) {
|
||||
// if target_pos == end_pos then we will find it in every empty result set,
|
||||
// so in that case we force the lower bound to be lower than end_pos
|
||||
if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
|
||||
// found a lower bound, set the low bound there and switch to binary search
|
||||
found_low = true;
|
||||
lower_bound = low_pos;
|
||||
DLOG("found_low = true, lower_bound = " << lower_bound);
|
||||
} else {
|
||||
// still looking for lower bound
|
||||
// if probe_pos was begin_pos then we can stop with no result
|
||||
if (probe_pos == begin_pos) {
|
||||
DLOG("return: probe_pos == begin_pos " << begin_pos);
|
||||
return begin_pos;
|
||||
}
|
||||
// double the range size, or use the distance between objects found so far
|
||||
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
|
||||
// already checked low_pos <= high_pos above
|
||||
const Pos want_delta = max(upper_bound - probe_pos, min_step);
|
||||
// avoid underflowing the beginning of the search space
|
||||
const Pos have_delta = min(want_delta, probe_pos - begin_pos);
|
||||
THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
|
||||
// move probe and try again
|
||||
probe_pos = probe_pos - have_delta;
|
||||
DLOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
|
||||
continue;
|
||||
static const Pos end_pos = numeric_limits<Pos>::max();
|
||||
// TBH this probably won't work if begin_pos != 0, i.e. any signed type
|
||||
static const Pos begin_pos = numeric_limits<Pos>::min();
|
||||
// Run a binary search looking for the highest key below target_pos.
|
||||
// Initial upper bound of the search is target_pos.
|
||||
// Find initial lower bound by doubling the size of the range until a key below target_pos
|
||||
// is found, or the lower bound reaches the beginning of the search space.
|
||||
// If the lower bound search reaches the beginning of the search space without finding a key,
|
||||
// return the beginning of the search space; otherwise, perform a binary search between
|
||||
// the bounds now established.
|
||||
Pos lower_bound = 0;
|
||||
Pos upper_bound = target_pos;
|
||||
bool found_low = false;
|
||||
Pos probe_pos = target_pos;
|
||||
// We need one loop for each bit of the search space to find the lower bound,
|
||||
// one loop for each bit of the search space to find the upper bound,
|
||||
// and one extra loop to confirm the boundary is correct.
|
||||
for (size_t loop_count = min((1 + numeric_limits<Pos>::digits) * size_t(2), max_loops); loop_count; --loop_count) {
|
||||
SEEKER_DEBUG_LOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
|
||||
auto result = fetch(probe_pos, target_pos);
|
||||
const Pos low_pos = result.empty() ? end_pos : *result.begin();
|
||||
const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
|
||||
SEEKER_DEBUG_LOG(" = " << low_pos << ".." << high_pos);
|
||||
// check for correct behavior of the fetch function
|
||||
THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
|
||||
THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
|
||||
THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
|
||||
if (!found_low) {
|
||||
// if target_pos == end_pos then we will find it in every empty result set,
|
||||
// so in that case we force the lower bound to be lower than end_pos
|
||||
if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
|
||||
// found a lower bound, set the low bound there and switch to binary search
|
||||
found_low = true;
|
||||
lower_bound = low_pos;
|
||||
SEEKER_DEBUG_LOG("found_low = true, lower_bound = " << lower_bound);
|
||||
} else {
|
||||
// still looking for lower bound
|
||||
// if probe_pos was begin_pos then we can stop with no result
|
||||
if (probe_pos == begin_pos) {
|
||||
SEEKER_DEBUG_LOG("return: probe_pos == begin_pos " << begin_pos);
|
||||
return begin_pos;
|
||||
}
|
||||
// double the range size, or use the distance between objects found so far
|
||||
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
|
||||
// already checked low_pos <= high_pos above
|
||||
const Pos want_delta = max(upper_bound - probe_pos, min_step);
|
||||
// avoid underflowing the beginning of the search space
|
||||
const Pos have_delta = min(want_delta, probe_pos - begin_pos);
|
||||
THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
|
||||
// move probe and try again
|
||||
probe_pos = probe_pos - have_delta;
|
||||
SEEKER_DEBUG_LOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
|
||||
continue;
|
||||
}
|
||||
if (low_pos <= target_pos && target_pos <= high_pos) {
|
||||
// have keys on either side of target_pos in result
|
||||
// search from the high end until we find the highest key below target
|
||||
for (auto i = result.rbegin(); i != result.rend(); ++i) {
|
||||
// more correctness checking for fetch
|
||||
THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
|
||||
if (*i <= target_pos) {
|
||||
DLOG("return: *i " << *i << " <= target_pos " << target_pos);
|
||||
return *i;
|
||||
}
|
||||
}
|
||||
// if the list is empty then low_pos = high_pos = end_pos
|
||||
// if target_pos = end_pos also, then we will execute the loop
|
||||
// above but not find any matching entries.
|
||||
THROW_CHECK0(runtime_error, result.empty());
|
||||
}
|
||||
if (target_pos <= low_pos) {
|
||||
// results are all too high, so probe_pos..low_pos is too high
|
||||
// lower the high bound to the probe pos
|
||||
upper_bound = probe_pos;
|
||||
DLOG("upper_bound = probe_pos " << probe_pos);
|
||||
}
|
||||
if (high_pos < target_pos) {
|
||||
// results are all too low, so probe_pos..high_pos is too low
|
||||
// raise the low bound to the high_pos
|
||||
DLOG("lower_bound = high_pos " << high_pos);
|
||||
lower_bound = high_pos;
|
||||
}
|
||||
// compute a new probe pos at the middle of the range and try again
|
||||
// we can't have a zero-size range here because we would not have set found_low yet
|
||||
THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
|
||||
const Pos delta = (upper_bound - lower_bound) / 2;
|
||||
probe_pos = lower_bound + delta;
|
||||
if (delta < 1) {
|
||||
// nothing can exist in the range (lower_bound, upper_bound)
|
||||
// and an object is known to exist at lower_bound
|
||||
DLOG("return: probe_pos == lower_bound " << lower_bound);
|
||||
return lower_bound;
|
||||
}
|
||||
THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
|
||||
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
|
||||
DLOG("loop: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
|
||||
}
|
||||
THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
|
||||
"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
|
||||
"found_low " << found_low);
|
||||
} catch (...) {
|
||||
DOUT(cerr);
|
||||
throw;
|
||||
if (low_pos <= target_pos && target_pos <= high_pos) {
|
||||
// have keys on either side of target_pos in result
|
||||
// search from the high end until we find the highest key below target
|
||||
for (auto i = result.rbegin(); i != result.rend(); ++i) {
|
||||
// more correctness checking for fetch
|
||||
THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
|
||||
if (*i <= target_pos) {
|
||||
SEEKER_DEBUG_LOG("return: *i " << *i << " <= target_pos " << target_pos);
|
||||
return *i;
|
||||
}
|
||||
}
|
||||
// if the list is empty then low_pos = high_pos = end_pos
|
||||
// if target_pos = end_pos also, then we will execute the loop
|
||||
// above but not find any matching entries.
|
||||
THROW_CHECK0(runtime_error, result.empty());
|
||||
}
|
||||
if (target_pos <= low_pos) {
|
||||
// results are all too high, so probe_pos..low_pos is too high
|
||||
// lower the high bound to the probe pos, low_pos cannot be lower
|
||||
SEEKER_DEBUG_LOG("upper_bound = probe_pos " << probe_pos);
|
||||
upper_bound = probe_pos;
|
||||
}
|
||||
if (high_pos < target_pos) {
|
||||
// results are all too low, so probe_pos..high_pos is too low
|
||||
// raise the low bound to high_pos but not above upper_bound
|
||||
const auto next_pos = min(high_pos, upper_bound);
|
||||
SEEKER_DEBUG_LOG("lower_bound = next_pos " << next_pos);
|
||||
lower_bound = next_pos;
|
||||
}
|
||||
// compute a new probe pos at the middle of the range and try again
|
||||
// we can't have a zero-size range here because we would not have set found_low yet
|
||||
THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
|
||||
const Pos delta = (upper_bound - lower_bound) / 2;
|
||||
probe_pos = lower_bound + delta;
|
||||
if (delta < 1) {
|
||||
// nothing can exist in the range (lower_bound, upper_bound)
|
||||
// and an object is known to exist at lower_bound
|
||||
SEEKER_DEBUG_LOG("return: probe_pos == lower_bound " << lower_bound);
|
||||
return lower_bound;
|
||||
}
|
||||
THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
|
||||
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
|
||||
SEEKER_DEBUG_LOG("loop bottom: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
|
||||
}
|
||||
THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
|
||||
"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
|
||||
"found_low " << found_low);
|
||||
}
|
||||
}
|
||||
|
||||
|
@@ -47,6 +47,10 @@ namespace crucible {
|
||||
/// been destroyed.
|
||||
void append(const Task &task) const;
|
||||
|
||||
/// Schedule Task to run after this Task has run or
|
||||
/// been destroyed, in Task ID order.
|
||||
void insert(const Task &task) const;
|
||||
|
||||
/// Describe Task as text.
|
||||
string title() const;
|
||||
|
||||
@@ -172,9 +176,6 @@ namespace crucible {
|
||||
/// objects it holds, and exit its Task function.
|
||||
ExclusionLock try_lock(const Task &task);
|
||||
|
||||
/// Execute Task when Exclusion is unlocked (possibly
|
||||
/// immediately).
|
||||
void insert_task(const Task &t);
|
||||
};
|
||||
|
||||
/// Wrapper around pthread_setname_np which handles length limits
|
||||
|
@@ -14,8 +14,10 @@ CRUCIBLE_OBJS = \
|
||||
fs.o \
|
||||
multilock.o \
|
||||
ntoa.o \
|
||||
openat2.o \
|
||||
path.o \
|
||||
process.o \
|
||||
seeker.o \
|
||||
string.o \
|
||||
table.o \
|
||||
task.o \
|
||||
|
@@ -5,6 +5,12 @@
|
||||
#include "crucible/hexdump.h"
|
||||
#include "crucible/seeker.h"
|
||||
|
||||
#define CRUCIBLE_BTRFS_TREE_DEBUG(x) do { \
|
||||
if (BtrfsIoctlSearchKey::s_debug_ostream) { \
|
||||
(*BtrfsIoctlSearchKey::s_debug_ostream) << x; \
|
||||
} \
|
||||
} while (false)
|
||||
|
||||
namespace crucible {
|
||||
using namespace std;
|
||||
|
||||
@@ -355,6 +361,7 @@ namespace crucible {
|
||||
BtrfsTreeItem
|
||||
BtrfsTreeFetcher::at(uint64_t logical)
|
||||
{
|
||||
CRUCIBLE_BTRFS_TREE_DEBUG("at " << logical);
|
||||
BtrfsIoctlSearchKey &sk = m_sk;
|
||||
fill_sk(sk, logical);
|
||||
// Exact match, should return 0 or 1 items
|
||||
@@ -397,53 +404,59 @@ namespace crucible {
|
||||
BtrfsTreeFetcher::rlower_bound(uint64_t logical)
|
||||
{
|
||||
#if 0
|
||||
#define BTFRLB_DEBUG(x) do { cerr << x; } while (false)
|
||||
static bool btfrlb_debug = getenv("BTFLRB_DEBUG");
|
||||
#define BTFRLB_DEBUG(x) do { if (btfrlb_debug) cerr << x; } while (false)
|
||||
#else
|
||||
#define BTFRLB_DEBUG(x) do { } while (false)
|
||||
#define BTFRLB_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
|
||||
#endif
|
||||
BtrfsTreeItem closest_item;
|
||||
uint64_t closest_logical = 0;
|
||||
BtrfsIoctlSearchKey &sk = m_sk;
|
||||
size_t loops = 0;
|
||||
BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << endl);
|
||||
seek_backward(scale_logical(logical), [&](uint64_t lower_bound, uint64_t upper_bound) {
|
||||
BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << " in tree " << tree() << endl);
|
||||
seek_backward(scale_logical(logical), [&](uint64_t const lower_bound, uint64_t const upper_bound) {
|
||||
++loops;
|
||||
fill_sk(sk, unscale_logical(min(scaled_max_logical(), lower_bound)));
|
||||
set<uint64_t> rv;
|
||||
bool too_far = false;
|
||||
do {
|
||||
sk.nr_items = 4;
|
||||
sk.do_ioctl(fd());
|
||||
BTFRLB_DEBUG("fetch: loop " << loops << " lower_bound..upper_bound " << to_hex(lower_bound) << ".." << to_hex(upper_bound));
|
||||
for (auto &i : sk.m_result) {
|
||||
next_sk(sk, i);
|
||||
const auto this_logical = hdr_logical(i);
|
||||
const auto scaled_hdr_logical = scale_logical(this_logical);
|
||||
BTFRLB_DEBUG(" " << to_hex(scaled_hdr_logical));
|
||||
if (hdr_match(i)) {
|
||||
if (this_logical <= logical && this_logical > closest_logical) {
|
||||
closest_logical = this_logical;
|
||||
closest_item = i;
|
||||
}
|
||||
BTFRLB_DEBUG("(match)");
|
||||
rv.insert(scaled_hdr_logical);
|
||||
}
|
||||
if (scaled_hdr_logical > upper_bound || hdr_stop(i)) {
|
||||
if (scaled_hdr_logical >= upper_bound) {
|
||||
BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
|
||||
}
|
||||
if (hdr_stop(i)) {
|
||||
rv.insert(numeric_limits<uint64_t>::max());
|
||||
BTFRLB_DEBUG("(stop)");
|
||||
}
|
||||
// If hdr_stop or !hdr_match, don't inspect the item
|
||||
if (hdr_stop(i)) {
|
||||
too_far = true;
|
||||
rv.insert(numeric_limits<uint64_t>::max());
|
||||
BTFRLB_DEBUG("(stop)");
|
||||
break;
|
||||
} else {
|
||||
BTFRLB_DEBUG("(cont'd)");
|
||||
}
|
||||
if (!hdr_match(i)) {
|
||||
BTFRLB_DEBUG("(no match)");
|
||||
continue;
|
||||
}
|
||||
const auto this_logical = hdr_logical(i);
|
||||
BTFRLB_DEBUG(" " << to_hex(this_logical) << " " << i);
|
||||
const auto scaled_hdr_logical = scale_logical(this_logical);
|
||||
BTFRLB_DEBUG(" " << "(match)");
|
||||
if (scaled_hdr_logical > upper_bound) {
|
||||
too_far = true;
|
||||
BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
|
||||
break;
|
||||
}
|
||||
if (this_logical <= logical && this_logical > closest_logical) {
|
||||
closest_logical = this_logical;
|
||||
closest_item = i;
|
||||
BTFRLB_DEBUG("(closest)");
|
||||
}
|
||||
rv.insert(scaled_hdr_logical);
|
||||
BTFRLB_DEBUG("(cont'd)");
|
||||
}
|
||||
BTFRLB_DEBUG(endl);
|
||||
// We might get a search result that contains only non-matching items.
|
||||
// Keep looping until we find any matching item or we run out of tree.
|
||||
} while (rv.empty() && !sk.m_result.empty());
|
||||
} while (!too_far && rv.empty() && !sk.m_result.empty());
|
||||
return rv;
|
||||
}, scale_logical(lookbehind_size()));
|
||||
return closest_item;
|
||||
@@ -474,6 +487,7 @@ namespace crucible {
|
||||
BtrfsTreeItem
|
||||
BtrfsTreeFetcher::next(uint64_t logical)
|
||||
{
|
||||
CRUCIBLE_BTRFS_TREE_DEBUG("next " << logical);
|
||||
const auto scaled_logical = scale_logical(logical);
|
||||
if (scaled_logical + 1 > scaled_max_logical()) {
|
||||
return BtrfsTreeItem();
|
||||
@@ -484,6 +498,7 @@ namespace crucible {
|
||||
BtrfsTreeItem
|
||||
BtrfsTreeFetcher::prev(uint64_t logical)
|
||||
{
|
||||
CRUCIBLE_BTRFS_TREE_DEBUG("prev " << logical);
|
||||
const auto scaled_logical = scale_logical(logical);
|
||||
if (scaled_logical < 1) {
|
||||
return BtrfsTreeItem();
|
||||
@@ -568,9 +583,10 @@ namespace crucible {
|
||||
BtrfsCsumTreeFetcher::get_sums(uint64_t const logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t bytes)> output)
|
||||
{
|
||||
#if 0
|
||||
#define BCTFGS_DEBUG(x) do { cerr << x; } while (false)
|
||||
static bool bctfgs_debug = getenv("BCTFGS_DEBUG");
|
||||
#define BCTFGS_DEBUG(x) do { if (bctfgs_debug) cerr << x; } while (false)
|
||||
#else
|
||||
#define BCTFGS_DEBUG(x) do { } while (false)
|
||||
#define BCTFGS_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
|
||||
#endif
|
||||
const uint64_t logical_end = logical + count * block_size();
|
||||
BtrfsTreeItem bti = rlower_bound(logical);
|
||||
@@ -662,14 +678,6 @@ namespace crucible {
|
||||
type(BTRFS_EXTENT_DATA_KEY);
|
||||
}
|
||||
|
||||
BtrfsFsTreeFetcher::BtrfsFsTreeFetcher(const Fd &new_fd, uint64_t subvol) :
|
||||
BtrfsTreeObjectFetcher(new_fd)
|
||||
{
|
||||
tree(subvol);
|
||||
type(BTRFS_EXTENT_DATA_KEY);
|
||||
scale_size(1);
|
||||
}
|
||||
|
||||
BtrfsInodeFetcher::BtrfsInodeFetcher(const Fd &fd) :
|
||||
BtrfsTreeObjectFetcher(fd)
|
||||
{
|
||||
@@ -693,18 +701,86 @@ namespace crucible {
|
||||
BtrfsTreeObjectFetcher(fd)
|
||||
{
|
||||
tree(BTRFS_ROOT_TREE_OBJECTID);
|
||||
type(BTRFS_ROOT_ITEM_KEY);
|
||||
scale_size(1);
|
||||
}
|
||||
|
||||
BtrfsTreeItem
|
||||
BtrfsRootFetcher::root(uint64_t subvol)
|
||||
BtrfsRootFetcher::root(const uint64_t subvol)
|
||||
{
|
||||
const auto my_type = BTRFS_ROOT_ITEM_KEY;
|
||||
type(my_type);
|
||||
const auto item = at(subvol);
|
||||
if (!!item) {
|
||||
THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
|
||||
THROW_CHECK2(runtime_error, item.type(), BTRFS_ROOT_ITEM_KEY, item.type() == BTRFS_ROOT_ITEM_KEY);
|
||||
THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
|
||||
}
|
||||
return item;
|
||||
}
|
||||
|
||||
BtrfsTreeItem
|
||||
BtrfsRootFetcher::root_backref(const uint64_t subvol)
|
||||
{
|
||||
const auto my_type = BTRFS_ROOT_BACKREF_KEY;
|
||||
type(my_type);
|
||||
const auto item = at(subvol);
|
||||
if (!!item) {
|
||||
THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
|
||||
THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
|
||||
}
|
||||
return item;
|
||||
}
|
||||
|
||||
BtrfsDataExtentTreeFetcher::BtrfsDataExtentTreeFetcher(const Fd &fd) :
|
||||
BtrfsExtentItemFetcher(fd),
|
||||
m_chunk_tree(fd)
|
||||
{
|
||||
tree(BTRFS_EXTENT_TREE_OBJECTID);
|
||||
type(BTRFS_EXTENT_ITEM_KEY);
|
||||
m_chunk_tree.tree(BTRFS_CHUNK_TREE_OBJECTID);
|
||||
m_chunk_tree.type(BTRFS_CHUNK_ITEM_KEY);
|
||||
m_chunk_tree.objectid(BTRFS_FIRST_CHUNK_TREE_OBJECTID);
|
||||
}
|
||||
|
||||
void
|
||||
BtrfsDataExtentTreeFetcher::next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr)
|
||||
{
|
||||
key.min_type = key.max_type = type();
|
||||
key.max_objectid = key.max_offset = numeric_limits<uint64_t>::max();
|
||||
key.min_offset = 0;
|
||||
key.min_objectid = hdr.objectid;
|
||||
const auto step = scale_size();
|
||||
if (key.min_objectid < numeric_limits<uint64_t>::max() - step) {
|
||||
key.min_objectid += step;
|
||||
} else {
|
||||
key.min_objectid = numeric_limits<uint64_t>::max();
|
||||
}
|
||||
// If we're still in our current block group, check here
|
||||
if (!!m_current_bg) {
|
||||
const auto bg_begin = m_current_bg.offset();
|
||||
const auto bg_end = bg_begin + m_current_bg.chunk_length();
|
||||
// If we are still in our current block group, return early
|
||||
if (key.min_objectid >= bg_begin && key.min_objectid < bg_end) return;
|
||||
}
|
||||
// We don't have a current block group or we're out of range
|
||||
// Find the chunk that this bytenr belongs to
|
||||
m_current_bg = m_chunk_tree.rlower_bound(key.min_objectid);
|
||||
// Make sure it's a data block group
|
||||
while (!!m_current_bg) {
|
||||
// Data block group, stop here
|
||||
if (m_current_bg.chunk_type() & BTRFS_BLOCK_GROUP_DATA) break;
|
||||
// Not a data block group, skip to end
|
||||
key.min_objectid = m_current_bg.offset() + m_current_bg.chunk_length();
|
||||
m_current_bg = m_chunk_tree.lower_bound(key.min_objectid);
|
||||
}
|
||||
if (!m_current_bg) {
|
||||
// Ran out of data block groups, stop here
|
||||
return;
|
||||
}
|
||||
// Check to see if bytenr is in the current data block group
|
||||
const auto bg_begin = m_current_bg.offset();
|
||||
if (key.min_objectid < bg_begin) {
|
||||
// Move forward to start of data block group
|
||||
key.min_objectid = bg_begin;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -76,7 +76,7 @@ namespace crucible {
|
||||
DIE_IF_ZERO(strftime(buf, sizeof(buf), "%Y-%m-%d %H:%M:%S", <m));
|
||||
|
||||
header_stream << buf;
|
||||
header_stream << " " << getpid() << "." << crucible::gettid();
|
||||
header_stream << " " << getpid() << "." << gettid();
|
||||
if (add_prefix_level) {
|
||||
header_stream << "<" << m_loglevel << ">";
|
||||
}
|
||||
@@ -88,7 +88,7 @@ namespace crucible {
|
||||
header_stream << "<" << m_loglevel << ">";
|
||||
}
|
||||
header_stream << (m_name.empty() ? "thread" : m_name);
|
||||
header_stream << "[" << crucible::gettid() << "]";
|
||||
header_stream << "[" << gettid() << "]";
|
||||
}
|
||||
|
||||
header_stream << ": ";
|
||||
|
51
lib/fs.cc
51
lib/fs.cc
@@ -757,6 +757,7 @@ namespace crucible {
|
||||
thread_local size_t BtrfsIoctlSearchKey::s_calls = 0;
|
||||
thread_local size_t BtrfsIoctlSearchKey::s_loops = 0;
|
||||
thread_local size_t BtrfsIoctlSearchKey::s_loops_empty = 0;
|
||||
thread_local shared_ptr<ostream> BtrfsIoctlSearchKey::s_debug_ostream;
|
||||
|
||||
bool
|
||||
BtrfsIoctlSearchKey::do_ioctl_nothrow(int fd)
|
||||
@@ -776,6 +777,9 @@ namespace crucible {
|
||||
ioctl_ptr = ioctl_arg.get<btrfs_ioctl_search_args_v2>();
|
||||
ioctl_ptr->key = static_cast<const btrfs_ioctl_search_key&>(*this);
|
||||
ioctl_ptr->buf_size = buf_size;
|
||||
if (s_debug_ostream) {
|
||||
(*s_debug_ostream) << "bisk " << (ioctl_ptr->key) << "\n";
|
||||
}
|
||||
// Don't bother supporting V1. Kernels that old have other problems.
|
||||
int rv = ioctl(fd, BTRFS_IOC_TREE_SEARCH_V2, ioctl_arg.data());
|
||||
++s_calls;
|
||||
@@ -881,6 +885,26 @@ namespace crucible {
|
||||
}
|
||||
}
|
||||
|
||||
string
|
||||
btrfs_chunk_type_ntoa(uint64_t type)
|
||||
{
|
||||
static const bits_ntoa_table table[] = {
|
||||
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DATA),
|
||||
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_METADATA),
|
||||
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_SYSTEM),
|
||||
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DUP),
|
||||
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID0),
|
||||
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1),
|
||||
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID10),
|
||||
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C3),
|
||||
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C4),
|
||||
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID5),
|
||||
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID6),
|
||||
NTOA_TABLE_ENTRY_END()
|
||||
};
|
||||
return bits_ntoa(type, table);
|
||||
}
|
||||
|
||||
string
|
||||
btrfs_search_type_ntoa(unsigned type)
|
||||
{
|
||||
@@ -908,15 +932,9 @@ namespace crucible {
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_BLOCK_REF_KEY),
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_DATA_REF_KEY),
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_BLOCK_GROUP_ITEM_KEY),
|
||||
#ifdef BTRFS_FREE_SPACE_INFO_KEY
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_INFO_KEY),
|
||||
#endif
|
||||
#ifdef BTRFS_FREE_SPACE_EXTENT_KEY
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_EXTENT_KEY),
|
||||
#endif
|
||||
#ifdef BTRFS_FREE_SPACE_BITMAP_KEY
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_BITMAP_KEY),
|
||||
#endif
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_EXTENT_KEY),
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_ITEM_KEY),
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_CHUNK_ITEM_KEY),
|
||||
@@ -948,9 +966,7 @@ namespace crucible {
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_CSUM_TREE_OBJECTID),
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_QUOTA_TREE_OBJECTID),
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_UUID_TREE_OBJECTID),
|
||||
#ifdef BTRFS_FREE_SPACE_TREE_OBJECTID
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_TREE_OBJECTID),
|
||||
#endif
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_BALANCE_OBJECTID),
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_ORPHAN_OBJECTID),
|
||||
NTOA_TABLE_ENTRY_ENUM(BTRFS_TREE_LOG_OBJECTID),
|
||||
@@ -1138,11 +1154,17 @@ namespace crucible {
|
||||
{
|
||||
}
|
||||
|
||||
void
|
||||
BtrfsIoctlFsInfoArgs::do_ioctl(int fd)
|
||||
bool
|
||||
BtrfsIoctlFsInfoArgs::do_ioctl_nothrow(int const fd)
|
||||
{
|
||||
btrfs_ioctl_fs_info_args_v3 *p = static_cast<btrfs_ioctl_fs_info_args_v3 *>(this);
|
||||
if (ioctl(fd, BTRFS_IOC_FS_INFO, p)) {
|
||||
return 0 == ioctl(fd, BTRFS_IOC_FS_INFO, p);
|
||||
}
|
||||
|
||||
void
|
||||
BtrfsIoctlFsInfoArgs::do_ioctl(int const fd)
|
||||
{
|
||||
if (!do_ioctl_nothrow(fd)) {
|
||||
THROW_ERRNO("BTRFS_IOC_FS_INFO: fd " << fd);
|
||||
}
|
||||
}
|
||||
@@ -1159,6 +1181,13 @@ namespace crucible {
|
||||
return this->btrfs_ioctl_fs_info_args_v3::csum_size;
|
||||
}
|
||||
|
||||
vector<uint8_t>
|
||||
BtrfsIoctlFsInfoArgs::fsid() const
|
||||
{
|
||||
const auto begin = btrfs_ioctl_fs_info_args_v3::fsid;
|
||||
return vector<uint8_t>(begin, begin + BTRFS_FSID_SIZE);
|
||||
}
|
||||
|
||||
uint64_t
|
||||
BtrfsIoctlFsInfoArgs::generation() const
|
||||
{
|
||||
|
40
lib/openat2.cc
Normal file
40
lib/openat2.cc
Normal file
@@ -0,0 +1,40 @@
|
||||
#include "crucible/openat2.h"
|
||||
|
||||
#include <sys/syscall.h>
|
||||
|
||||
// Compatibility for building on old libc for new kernel
|
||||
|
||||
#if LINUX_VERSION_CODE < KERNEL_VERSION(5, 6, 0)
|
||||
|
||||
// Every arch that defines this uses 437, except Alpha, where 437 is
|
||||
// mq_getsetattr.
|
||||
|
||||
#ifndef SYS_openat2
|
||||
#ifdef __alpha__
|
||||
#define SYS_openat2 547
|
||||
#else
|
||||
#define SYS_openat2 437
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#endif // Linux version >= v5.6
|
||||
|
||||
#include <fcntl.h>
|
||||
#include <unistd.h>
|
||||
|
||||
extern "C" {
|
||||
|
||||
int
|
||||
__attribute__((weak))
|
||||
openat2(int const dirfd, const char *const pathname, struct open_how *const how, size_t const size)
|
||||
throw()
|
||||
{
|
||||
#ifdef SYS_openat2
|
||||
return syscall(SYS_openat2, dirfd, pathname, how, size);
|
||||
#else
|
||||
errno = ENOSYS;
|
||||
return -1;
|
||||
#endif
|
||||
}
|
||||
|
||||
};
|
@@ -7,13 +7,18 @@
|
||||
#include <cstdlib>
|
||||
#include <utility>
|
||||
|
||||
// for gettid()
|
||||
#ifndef _GNU_SOURCE
|
||||
#define _GNU_SOURCE
|
||||
#endif
|
||||
#include <unistd.h>
|
||||
#include <sys/syscall.h>
|
||||
|
||||
extern "C" {
|
||||
pid_t
|
||||
__attribute__((weak))
|
||||
gettid() throw()
|
||||
{
|
||||
return syscall(SYS_gettid);
|
||||
}
|
||||
};
|
||||
|
||||
namespace crucible {
|
||||
using namespace std;
|
||||
|
||||
@@ -111,12 +116,6 @@ namespace crucible {
|
||||
}
|
||||
}
|
||||
|
||||
pid_t
|
||||
gettid()
|
||||
{
|
||||
return syscall(SYS_gettid);
|
||||
}
|
||||
|
||||
double
|
||||
getloadavg1()
|
||||
{
|
||||
|
7
lib/seeker.cc
Normal file
7
lib/seeker.cc
Normal file
@@ -0,0 +1,7 @@
|
||||
#include "crucible/seeker.h"
|
||||
|
||||
namespace crucible {
|
||||
|
||||
thread_local shared_ptr<ostream> tl_seeker_debug_str;
|
||||
|
||||
};
|
116
lib/task.cc
116
lib/task.cc
@@ -76,13 +76,24 @@ namespace crucible {
|
||||
/// Tasks to be executed after the current task is executed
|
||||
list<TaskStatePtr> m_post_exec_queue;
|
||||
|
||||
/// Set by run() and append(). Cleared by exec().
|
||||
/// Set by run(), append(), and insert(). Cleared by exec().
|
||||
bool m_run_now = false;
|
||||
|
||||
/// Set by insert(). Cleared by exec() and destructor.
|
||||
bool m_sort_queue = false;
|
||||
|
||||
/// Set when task starts execution by exec().
|
||||
/// Cleared when exec() ends.
|
||||
bool m_is_running = false;
|
||||
|
||||
/// Set when task is queued while already running.
|
||||
/// Cleared when task is requeued.
|
||||
bool m_run_again = false;
|
||||
|
||||
/// Set when task is queued as idle task while already running.
|
||||
/// Cleared when task is queued as non-idle task.
|
||||
bool m_idle = false;
|
||||
|
||||
/// Sequential identifier for next task
|
||||
static atomic<TaskId> s_next_id;
|
||||
|
||||
@@ -107,7 +118,7 @@ namespace crucible {
|
||||
static void clear_queue(TaskQueue &tq);
|
||||
|
||||
/// Rescue any TaskQueue, not just this one.
|
||||
static void rescue_queue(TaskQueue &tq);
|
||||
static void rescue_queue(TaskQueue &tq, const bool sort_queue);
|
||||
|
||||
TaskState &operator=(const TaskState &) = delete;
|
||||
TaskState(const TaskState &) = delete;
|
||||
@@ -142,6 +153,10 @@ namespace crucible {
|
||||
/// or is destroyed.
|
||||
void append(const TaskStatePtr &task);
|
||||
|
||||
/// Queue task to execute after current task finishes executing
|
||||
/// or is destroyed, in task ID order.
|
||||
void insert(const TaskStatePtr &task);
|
||||
|
||||
/// How masy Tasks are there? Good for catching leaks
|
||||
static size_t instance_count();
|
||||
};
|
||||
@@ -219,16 +234,21 @@ namespace crucible {
|
||||
static auto s_tms = make_shared<TaskMasterState>();
|
||||
|
||||
void
|
||||
TaskState::rescue_queue(TaskQueue &queue)
|
||||
TaskState::rescue_queue(TaskQueue &queue, const bool sort_queue)
|
||||
{
|
||||
if (queue.empty()) {
|
||||
return;
|
||||
}
|
||||
const auto tlcc = tl_current_consumer;
|
||||
const auto &tlcc = tl_current_consumer;
|
||||
if (tlcc) {
|
||||
// We are executing under a TaskConsumer, splice our post-exec queue at front.
|
||||
// No locks needed because we are using only thread-local objects.
|
||||
tlcc->m_local_queue.splice(tlcc->m_local_queue.begin(), queue);
|
||||
if (sort_queue) {
|
||||
tlcc->m_local_queue.sort([&](const TaskStatePtr &a, const TaskStatePtr &b) {
|
||||
return a->m_id < b->m_id;
|
||||
});
|
||||
}
|
||||
} else {
|
||||
// We are not executing under a TaskConsumer.
|
||||
// If there is only one task, then just insert it at the front of the queue.
|
||||
@@ -239,6 +259,8 @@ namespace crucible {
|
||||
// then push it to the front of the global queue using normal locking methods.
|
||||
TaskStatePtr rescue_task(make_shared<TaskState>("rescue_task", [](){}));
|
||||
swap(rescue_task->m_post_exec_queue, queue);
|
||||
// Do the sort--once--when a new Consumer has picked up the Task
|
||||
rescue_task->m_sort_queue = sort_queue;
|
||||
TaskQueue tq_one { rescue_task };
|
||||
TaskMasterState::push_front(tq_one);
|
||||
}
|
||||
@@ -251,7 +273,8 @@ namespace crucible {
|
||||
--s_instance_count;
|
||||
unique_lock<mutex> lock(m_mutex);
|
||||
// If any dependent Tasks were appended since the last exec, run them now
|
||||
TaskState::rescue_queue(m_post_exec_queue);
|
||||
TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
|
||||
// No need to clear m_sort_queue here, it won't exist soon
|
||||
}
|
||||
|
||||
TaskState::TaskState(string title, function<void()> exec_fn) :
|
||||
@@ -310,6 +333,24 @@ namespace crucible {
|
||||
task->m_run_now = true;
|
||||
append_nolock(task);
|
||||
}
|
||||
task->m_idle = false;
|
||||
}
|
||||
|
||||
void
|
||||
TaskState::insert(const TaskStatePtr &task)
|
||||
{
|
||||
THROW_CHECK0(invalid_argument, task);
|
||||
THROW_CHECK2(invalid_argument, m_id, task->m_id, m_id != task->m_id);
|
||||
PairLock lock(m_mutex, task->m_mutex);
|
||||
if (!task->m_run_now) {
|
||||
task->m_run_now = true;
|
||||
// Move the task and its post-exec queue to follow this task,
|
||||
// and request a sort of the flattened list.
|
||||
m_sort_queue = true;
|
||||
m_post_exec_queue.push_back(task);
|
||||
m_post_exec_queue.splice(m_post_exec_queue.end(), task->m_post_exec_queue);
|
||||
}
|
||||
task->m_idle = false;
|
||||
}
|
||||
|
||||
void
|
||||
@@ -320,7 +361,7 @@ namespace crucible {
|
||||
|
||||
unique_lock<mutex> lock(m_mutex);
|
||||
if (m_is_running) {
|
||||
append_nolock(shared_from_this());
|
||||
m_run_again = true;
|
||||
return;
|
||||
} else {
|
||||
m_run_now = false;
|
||||
@@ -344,8 +385,20 @@ namespace crucible {
|
||||
swap(this_task, tl_current_task);
|
||||
m_is_running = false;
|
||||
|
||||
if (m_run_again) {
|
||||
m_run_again = false;
|
||||
if (m_idle) {
|
||||
// All the way back to the end of the line
|
||||
TaskMasterState::push_back_idle(shared_from_this());
|
||||
} else {
|
||||
// Insert after any dependents waiting for this Task
|
||||
m_post_exec_queue.push_back(shared_from_this());
|
||||
}
|
||||
}
|
||||
|
||||
// Splice task post_exec queue at front of local queue
|
||||
TaskState::rescue_queue(m_post_exec_queue);
|
||||
TaskState::rescue_queue(m_post_exec_queue, m_sort_queue);
|
||||
m_sort_queue = false;
|
||||
}
|
||||
|
||||
string
|
||||
@@ -365,22 +418,32 @@ namespace crucible {
|
||||
TaskState::run()
|
||||
{
|
||||
unique_lock<mutex> lock(m_mutex);
|
||||
m_idle = false;
|
||||
if (m_run_now) {
|
||||
return;
|
||||
}
|
||||
m_run_now = true;
|
||||
TaskMasterState::push_back(shared_from_this());
|
||||
if (m_is_running) {
|
||||
m_run_again = true;
|
||||
} else {
|
||||
TaskMasterState::push_back(shared_from_this());
|
||||
}
|
||||
}
|
||||
|
||||
void
|
||||
TaskState::idle()
|
||||
{
|
||||
unique_lock<mutex> lock(m_mutex);
|
||||
m_idle = true;
|
||||
if (m_run_now) {
|
||||
return;
|
||||
}
|
||||
m_run_now = true;
|
||||
TaskMasterState::push_back_idle(shared_from_this());
|
||||
if (m_is_running) {
|
||||
m_run_again = true;
|
||||
} else {
|
||||
TaskMasterState::push_back_idle(shared_from_this());
|
||||
}
|
||||
}
|
||||
|
||||
TaskMasterState::TaskMasterState(size_t thread_max) :
|
||||
@@ -740,6 +803,14 @@ namespace crucible {
|
||||
m_task_state->append(that.m_task_state);
|
||||
}
|
||||
|
||||
void
|
||||
Task::insert(const Task &that) const
|
||||
{
|
||||
THROW_CHECK0(runtime_error, m_task_state);
|
||||
THROW_CHECK0(runtime_error, that);
|
||||
m_task_state->insert(that.m_task_state);
|
||||
}
|
||||
|
||||
Task
|
||||
Task::current_task()
|
||||
{
|
||||
@@ -854,11 +925,13 @@ namespace crucible {
|
||||
swap(this_consumer, tl_current_consumer);
|
||||
assert(!tl_current_consumer);
|
||||
|
||||
// Release lock to rescue queue (may attempt to queue a new task at TaskMaster).
|
||||
// rescue_queue normally sends tasks to the local queue of the current TaskConsumer thread,
|
||||
// but we just disconnected ourselves from that.
|
||||
// Release lock to rescue queue (may attempt to queue a
|
||||
// new task at TaskMaster). rescue_queue normally sends
|
||||
// tasks to the local queue of the current TaskConsumer
|
||||
// thread, but we just disconnected ourselves from that.
|
||||
// No sorting here because this is not a TaskState.
|
||||
lock.unlock();
|
||||
TaskState::rescue_queue(m_local_queue);
|
||||
TaskState::rescue_queue(m_local_queue, false);
|
||||
|
||||
// Hold lock so we can erase ourselves
|
||||
lock.lock();
|
||||
@@ -936,21 +1009,6 @@ namespace crucible {
|
||||
m_owner.reset();
|
||||
}
|
||||
|
||||
void
|
||||
Exclusion::insert_task(const Task &task)
|
||||
{
|
||||
unique_lock<mutex> lock(m_mutex);
|
||||
const auto sp = m_owner.lock();
|
||||
lock.unlock();
|
||||
if (sp) {
|
||||
// If Exclusion is locked then queue task for release;
|
||||
sp->append(task);
|
||||
} else {
|
||||
// otherwise, run the inserted task immediately
|
||||
task.run();
|
||||
}
|
||||
}
|
||||
|
||||
ExclusionLock
|
||||
Exclusion::try_lock(const Task &task)
|
||||
{
|
||||
@@ -958,7 +1016,7 @@ namespace crucible {
|
||||
const auto sp = m_owner.lock();
|
||||
if (sp) {
|
||||
if (task) {
|
||||
sp->append(task);
|
||||
sp->insert(task);
|
||||
}
|
||||
return ExclusionLock();
|
||||
} else {
|
||||
|
@@ -1,5 +1,13 @@
|
||||
#!/bin/bash
|
||||
|
||||
# if not called from systemd try to replicate mount unsharing on ctrl+c
|
||||
# see: https://github.com/Zygo/bees/issues/281
|
||||
if [ -z "${SYSTEMD_EXEC_PID}" -a -z "${UNSHARE_DONE}" ]; then
|
||||
UNSHARE_DONE=true
|
||||
export UNSHARE_DONE
|
||||
exec unshare -m --propagation private -- "$0" "$@"
|
||||
fi
|
||||
|
||||
## Helpful functions
|
||||
INFO(){ echo "INFO:" "$@"; }
|
||||
ERRO(){ echo "ERROR:" "$@"; exit 1; }
|
||||
@@ -108,13 +116,11 @@ mkdir -p "$WORK_DIR" || exit 1
|
||||
INFO "MOUNT DIR: $MNT_DIR"
|
||||
mkdir -p "$MNT_DIR" || exit 1
|
||||
|
||||
mount --make-private -osubvolid=5 /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
|
||||
mount --make-private -osubvolid=5,nodev,noexec /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
|
||||
|
||||
if [ ! -d "$BEESHOME" ]; then
|
||||
INFO "Create subvol $BEESHOME for store bees data"
|
||||
btrfs sub cre "$BEESHOME"
|
||||
else
|
||||
btrfs sub show "$BEESHOME" &> /dev/null || ERRO "$BEESHOME MUST BE A SUBVOL!"
|
||||
fi
|
||||
|
||||
# Check DB size
|
||||
|
@@ -17,6 +17,7 @@ KillSignal=SIGTERM
|
||||
MemoryAccounting=true
|
||||
Nice=19
|
||||
Restart=on-abnormal
|
||||
RuntimeDirectoryMode=0700
|
||||
RuntimeDirectory=bees
|
||||
StartupCPUWeight=25
|
||||
StartupIOWeight=25
|
||||
|
@@ -230,8 +230,10 @@ BeesContext::dedup(const BeesRangePair &brp_in)
|
||||
BeesAddress first_addr(brp.first.fd(), brp.first.begin());
|
||||
BeesAddress second_addr(brp.second.fd(), brp.second.begin());
|
||||
|
||||
if (first_addr.get_physical_or_zero() == second_addr.get_physical_or_zero()) {
|
||||
BEESLOGTRACE("equal physical addresses in dedup");
|
||||
const auto first_gpoz = first_addr.get_physical_or_zero();
|
||||
const auto second_gpoz = second_addr.get_physical_or_zero();
|
||||
if (first_gpoz == second_gpoz) {
|
||||
BEESLOGDEBUG("equal physical addresses " << first_addr << " and " << second_addr << " in dedup");
|
||||
BEESCOUNT(bug_dedup_same_physical);
|
||||
}
|
||||
|
||||
@@ -259,7 +261,7 @@ BeesContext::dedup(const BeesRangePair &brp_in)
|
||||
BEESCOUNTADD(dedup_bytes, brp.first.size());
|
||||
} else {
|
||||
BEESCOUNT(dedup_miss);
|
||||
BEESLOGWARN("NO Dedup! " << brp);
|
||||
BEESLOGINFO("NO Dedup! " << brp);
|
||||
}
|
||||
|
||||
lock.reset();
|
||||
@@ -373,7 +375,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
|
||||
Extent::OBSCURED | Extent::PREALLOC
|
||||
)) {
|
||||
BEESCOUNT(scan_interesting);
|
||||
BEESLOGWARN("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
|
||||
BEESLOGINFO("Interesting extent flags " << e << " from fd " << name_fd(bfr.fd()));
|
||||
}
|
||||
|
||||
if (e.flags() & Extent::HOLE) {
|
||||
@@ -385,7 +387,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
|
||||
if (e.flags() & Extent::PREALLOC) {
|
||||
// Prealloc is all zero and we replace it with a hole.
|
||||
// No special handling is required here. Nuke it and move on.
|
||||
BEESLOGINFO("prealloc extent " << e);
|
||||
BEESLOGINFO("prealloc extent " << e << " in " << bfr);
|
||||
// Must not extend past EOF
|
||||
auto extent_size = min(e.end(), bfr.file_size()) - e.begin();
|
||||
// Must hold tmpfile until dedupe is done
|
||||
@@ -534,7 +536,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
|
||||
|
||||
// Hash is toxic
|
||||
if (found_addr.is_toxic()) {
|
||||
BEESLOGWARN("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
|
||||
BEESLOGDEBUG("WORKAROUND: abandoned toxic match for hash " << hash << " addr " << found_addr << " matching bbd " << bbd);
|
||||
// Don't push these back in because we'll never delete them.
|
||||
// Extents may become non-toxic so give them a chance to expire.
|
||||
// hash_table->push_front_hash_addr(hash, found_addr);
|
||||
@@ -556,7 +558,7 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
|
||||
BeesResolver resolved(m_ctx, found_addr);
|
||||
// Toxic extents are really toxic
|
||||
if (resolved.is_toxic()) {
|
||||
BEESLOGWARN("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
|
||||
BEESLOGDEBUG("WORKAROUND: discovered toxic match at found_addr " << found_addr << " matching bbd " << bbd);
|
||||
BEESCOUNT(scan_toxic_match);
|
||||
// Make sure we never see this hash again.
|
||||
// It has become toxic since it was inserted into the hash table.
|
||||
@@ -917,7 +919,7 @@ BeesContext::scan_forward(const BeesFileRange &bfr_in)
|
||||
|
||||
// Sanity check
|
||||
if (bfr.begin() >= bfr.file_size()) {
|
||||
BEESLOGWARN("past EOF: " << bfr);
|
||||
BEESLOGDEBUG("past EOF: " << bfr);
|
||||
BEESCOUNT(scanf_eof);
|
||||
return false;
|
||||
}
|
||||
|
@@ -797,7 +797,7 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
|
||||
for (auto fp = madv_flags; fp->value; ++fp) {
|
||||
BEESTOOLONG("madvise(" << fp->name << ")");
|
||||
if (madvise(m_byte_ptr, m_size, fp->value)) {
|
||||
BEESLOGWARN("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
|
||||
BEESLOGNOTICE("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
|
||||
}
|
||||
}
|
||||
|
||||
@@ -811,8 +811,19 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t
|
||||
prefetch_loop();
|
||||
});
|
||||
|
||||
// Blacklist might fail if the hash table is not stored on a btrfs
|
||||
// Blacklist might fail if the hash table is not stored on a btrfs,
|
||||
// or if it's on a _different_ btrfs
|
||||
catch_all([&]() {
|
||||
// Root is definitely a btrfs
|
||||
BtrfsIoctlFsInfoArgs root_info;
|
||||
root_info.do_ioctl(m_ctx->root_fd());
|
||||
// Hash might not be a btrfs
|
||||
BtrfsIoctlFsInfoArgs hash_info;
|
||||
// If btrfs fs_info ioctl fails, it must be a different fs
|
||||
if (!hash_info.do_ioctl_nothrow(m_fd)) return;
|
||||
// If Hash is a btrfs, Root must be the same one
|
||||
if (root_info.fsid() != hash_info.fsid()) return;
|
||||
// Hash is on the same one, blacklist it
|
||||
m_ctx->blacklist_insert(BeesFileId(m_fd));
|
||||
});
|
||||
}
|
||||
|
File diff suppressed because it is too large
Load Diff
@@ -8,38 +8,32 @@ thread_local BeesTracer *BeesTracer::tl_next_tracer = nullptr;
|
||||
thread_local bool BeesTracer::tl_first = true;
|
||||
thread_local bool BeesTracer::tl_silent = false;
|
||||
|
||||
bool
|
||||
exception_check()
|
||||
{
|
||||
#if __cplusplus >= 201703
|
||||
static
|
||||
bool
|
||||
exception_check()
|
||||
{
|
||||
return uncaught_exceptions();
|
||||
}
|
||||
#else
|
||||
static
|
||||
bool
|
||||
exception_check()
|
||||
{
|
||||
return uncaught_exception();
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
BeesTracer::~BeesTracer()
|
||||
{
|
||||
if (!tl_silent && exception_check()) {
|
||||
if (tl_first) {
|
||||
BEESLOGNOTICE("--- BEGIN TRACE --- exception ---");
|
||||
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE --- exception ---");
|
||||
tl_first = false;
|
||||
}
|
||||
try {
|
||||
m_func();
|
||||
} catch (exception &e) {
|
||||
BEESLOGNOTICE("Nested exception: " << e.what());
|
||||
BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception: " << e.what());
|
||||
} catch (...) {
|
||||
BEESLOGNOTICE("Nested exception ...");
|
||||
BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception ...");
|
||||
}
|
||||
if (!m_next_tracer) {
|
||||
BEESLOGNOTICE("--- END TRACE --- exception ---");
|
||||
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- END TRACE --- exception ---");
|
||||
}
|
||||
}
|
||||
tl_next_tracer = m_next_tracer;
|
||||
@@ -49,7 +43,7 @@ BeesTracer::~BeesTracer()
|
||||
}
|
||||
}
|
||||
|
||||
BeesTracer::BeesTracer(function<void()> f, bool silent) :
|
||||
BeesTracer::BeesTracer(const function<void()> &f, bool silent) :
|
||||
m_func(f)
|
||||
{
|
||||
m_next_tracer = tl_next_tracer;
|
||||
@@ -61,12 +55,12 @@ void
|
||||
BeesTracer::trace_now()
|
||||
{
|
||||
BeesTracer *tp = tl_next_tracer;
|
||||
BEESLOGNOTICE("--- BEGIN TRACE ---");
|
||||
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE ---");
|
||||
while (tp) {
|
||||
tp->m_func();
|
||||
tp = tp->m_next_tracer;
|
||||
}
|
||||
BEESLOGNOTICE("--- END TRACE ---");
|
||||
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- END TRACE ---");
|
||||
}
|
||||
|
||||
bool
|
||||
@@ -91,9 +85,9 @@ BeesNote::~BeesNote()
|
||||
tl_next = m_prev;
|
||||
unique_lock<mutex> lock(s_mutex);
|
||||
if (tl_next) {
|
||||
s_status[crucible::gettid()] = tl_next;
|
||||
s_status[gettid()] = tl_next;
|
||||
} else {
|
||||
s_status.erase(crucible::gettid());
|
||||
s_status.erase(gettid());
|
||||
}
|
||||
}
|
||||
|
||||
@@ -104,7 +98,7 @@ BeesNote::BeesNote(function<void(ostream &os)> f) :
|
||||
m_prev = tl_next;
|
||||
tl_next = this;
|
||||
unique_lock<mutex> lock(s_mutex);
|
||||
s_status[crucible::gettid()] = tl_next;
|
||||
s_status[gettid()] = tl_next;
|
||||
}
|
||||
|
||||
void
|
||||
|
@@ -457,7 +457,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
|
||||
}
|
||||
}
|
||||
if (found_toxic) {
|
||||
BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
|
||||
BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
|
||||
BEESCOUNT(pairbackward_toxic_hash);
|
||||
break;
|
||||
}
|
||||
@@ -558,7 +558,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
|
||||
}
|
||||
}
|
||||
if (found_toxic) {
|
||||
BEESLOGWARN("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
|
||||
BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
|
||||
BEESCOUNT(pairforward_toxic_hash);
|
||||
break;
|
||||
}
|
||||
@@ -572,7 +572,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
|
||||
}
|
||||
|
||||
if (first.overlaps(second)) {
|
||||
BEESLOGTRACE("after grow, first " << first << "\n\toverlaps " << second);
|
||||
BEESLOGDEBUG("after grow, first " << first << "\n\toverlaps " << second);
|
||||
BEESCOUNT(bug_grow_pair_overlaps);
|
||||
}
|
||||
|
||||
@@ -674,7 +674,7 @@ BeesAddress::magic_check(uint64_t flags)
|
||||
static const unsigned recognized_flags = compressed_flags | delalloc_flags | ignore_flags | unusable_flags;
|
||||
|
||||
if (flags & ~recognized_flags) {
|
||||
BEESLOGTRACE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
|
||||
BEESLOGNOTICE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
|
||||
m_addr = UNUSABLE;
|
||||
// maybe we throw here?
|
||||
BEESCOUNT(addr_unrecognized);
|
||||
|
152
src/bees.cc
152
src/bees.cc
@@ -4,6 +4,7 @@
|
||||
#include "crucible/process.h"
|
||||
#include "crucible/string.h"
|
||||
#include "crucible/task.h"
|
||||
#include "crucible/uname.h"
|
||||
|
||||
#include <cctype>
|
||||
#include <cmath>
|
||||
@@ -11,17 +12,19 @@
|
||||
|
||||
#include <iostream>
|
||||
#include <memory>
|
||||
#include <regex>
|
||||
#include <sstream>
|
||||
|
||||
// PRIx64
|
||||
#include <inttypes.h>
|
||||
|
||||
#include <sched.h>
|
||||
#include <sys/fanotify.h>
|
||||
|
||||
#include <linux/fs.h>
|
||||
#include <sys/ioctl.h>
|
||||
|
||||
// statfs
|
||||
#include <linux/magic.h>
|
||||
#include <sys/statfs.h>
|
||||
|
||||
// setrlimit
|
||||
#include <sys/time.h>
|
||||
#include <sys/resource.h>
|
||||
@@ -198,7 +201,7 @@ BeesTooLong::check() const
|
||||
if (age() > m_limit) {
|
||||
ostringstream oss;
|
||||
m_func(oss);
|
||||
BEESLOGWARN("PERFORMANCE: " << *this << " sec: " << oss.str());
|
||||
BEESLOGINFO("PERFORMANCE: " << *this << " sec: " << oss.str());
|
||||
}
|
||||
}
|
||||
|
||||
@@ -246,10 +249,6 @@ bees_readahead_nolock(int const fd, const off_t offset, const size_t size)
|
||||
Timer readahead_timer;
|
||||
BEESNOTE("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
|
||||
BEESTOOLONG("readahead " << name_fd(fd) << " offset " << to_hex(offset) << " len " << pretty(size));
|
||||
#if 0
|
||||
// In the kernel, readahead() is identical to posix_fadvise(..., POSIX_FADV_DONTNEED)
|
||||
DIE_IF_NON_ZERO(readahead(fd, offset, size));
|
||||
#else
|
||||
// Make sure this data is in page cache by brute force
|
||||
// The btrfs kernel code does readahead with lower ioprio
|
||||
// and might discard the readahead request entirely.
|
||||
@@ -263,13 +262,16 @@ bees_readahead_nolock(int const fd, const off_t offset, const size_t size)
|
||||
// Ignore errors and short reads. It turns out our size
|
||||
// parameter isn't all that accurate, so we can't use
|
||||
// the pread_or_die template.
|
||||
(void)!pread(fd, dummy, this_read_size, working_offset);
|
||||
BEESCOUNT(readahead_count);
|
||||
BEESCOUNTADD(readahead_bytes, this_read_size);
|
||||
const auto pr_rv = pread(fd, dummy, this_read_size, working_offset);
|
||||
if (pr_rv >= 0) {
|
||||
BEESCOUNT(readahead_count);
|
||||
BEESCOUNTADD(readahead_bytes, pr_rv);
|
||||
} else {
|
||||
BEESCOUNT(readahead_fail);
|
||||
}
|
||||
working_offset += this_read_size;
|
||||
working_size -= this_read_size;
|
||||
}
|
||||
#endif
|
||||
BEESCOUNTADD(readahead_ms, readahead_timer.age() * 1000);
|
||||
}
|
||||
|
||||
@@ -392,6 +394,73 @@ BeesStringFile::read()
|
||||
return read_string(fd, st.st_size);
|
||||
}
|
||||
|
||||
static
|
||||
void
|
||||
bees_fsync(int const fd)
|
||||
{
|
||||
|
||||
// Note that when btrfs renames a temporary over an existing file,
|
||||
// it flushes the temporary, so we get the right behavior if we
|
||||
// just do nothing here (except when the file is first created;
|
||||
// however, in that case the result is the same as if the file
|
||||
// did not exist, was empty, or was filled with garbage).
|
||||
//
|
||||
// Kernel versions prior to 5.16 had bugs which would put ghost
|
||||
// dirents in $BEESHOME if there was a crash when we called
|
||||
// fsync() here.
|
||||
//
|
||||
// Some other filesystems will throw our data away if we don't
|
||||
// call fsync, so we do need to call fsync() on those filesystems.
|
||||
//
|
||||
// Newer btrfs kernel versions rely on fsync() to report
|
||||
// unrecoverable write errors. If we don't check the fsync()
|
||||
// result, we'll lose the data when we rename(). Kernel 6.2 added
|
||||
// a number of new root causes for the class of "unrecoverable
|
||||
// write errors" so we need to check this now.
|
||||
|
||||
BEESNOTE("checking filesystem type for " << name_fd(fd));
|
||||
// LSB deprecated statfs without providing a replacement that
|
||||
// can fill in the f_type field.
|
||||
struct statfs stf = { 0 };
|
||||
DIE_IF_NON_ZERO(fstatfs(fd, &stf));
|
||||
if (static_cast<decltype(BTRFS_SUPER_MAGIC)>(stf.f_type) != BTRFS_SUPER_MAGIC) {
|
||||
BEESLOGONCE("Using fsync on non-btrfs filesystem type " << to_hex(stf.f_type));
|
||||
BEESNOTE("fsync non-btrfs " << name_fd(fd));
|
||||
DIE_IF_NON_ZERO(fsync(fd));
|
||||
return;
|
||||
}
|
||||
|
||||
static bool did_uname = false;
|
||||
static bool do_fsync = false;
|
||||
|
||||
if (!did_uname) {
|
||||
Uname uname;
|
||||
const string version(uname.release);
|
||||
static const regex version_re(R"/(^(\d+)\.(\d+)\.)/", regex::optimize | regex::ECMAScript);
|
||||
smatch m;
|
||||
// Last known bug in the fsync-rename use case was fixed in kernel 5.16
|
||||
static const auto min_major = 5, min_minor = 16;
|
||||
if (regex_search(version, m, version_re)) {
|
||||
const auto major = stoul(m[1]);
|
||||
const auto minor = stoul(m[2]);
|
||||
if (tie(major, minor) > tie(min_major, min_minor)) {
|
||||
BEESLOGONCE("Using fsync on btrfs because kernel version is " << major << "." << minor);
|
||||
do_fsync = true;
|
||||
} else {
|
||||
BEESLOGONCE("Not using fsync on btrfs because kernel version is " << major << "." << minor);
|
||||
}
|
||||
} else {
|
||||
BEESLOGONCE("Not using fsync on btrfs because can't parse kernel version '" << version << "'");
|
||||
}
|
||||
did_uname = true;
|
||||
}
|
||||
|
||||
if (do_fsync) {
|
||||
BEESNOTE("fsync btrfs " << name_fd(fd));
|
||||
DIE_IF_NON_ZERO(fsync(fd));
|
||||
}
|
||||
}
|
||||
|
||||
void
|
||||
BeesStringFile::write(string contents)
|
||||
{
|
||||
@@ -407,19 +476,8 @@ BeesStringFile::write(string contents)
|
||||
Fd ofd = openat_or_die(m_dir_fd, tmpname, FLAGS_CREATE_FILE, S_IRUSR | S_IWUSR);
|
||||
BEESNOTE("writing " << tmpname << " in " << name_fd(m_dir_fd));
|
||||
write_or_die(ofd, contents);
|
||||
#if 0
|
||||
// This triggers too many btrfs bugs. I wish I was kidding.
|
||||
// Forget snapshots, balance, compression, and dedupe:
|
||||
// the system call you have to fear on btrfs is fsync().
|
||||
// Also note that when bees renames a temporary over an
|
||||
// existing file, it flushes the temporary, so we get
|
||||
// the right behavior if we just do nothing here
|
||||
// (except when the file is first created; however,
|
||||
// in that case the result is the same as if the file
|
||||
// did not exist, was empty, or was filled with garbage).
|
||||
BEESNOTE("fsyncing " << tmpname << " in " << name_fd(m_dir_fd));
|
||||
DIE_IF_NON_ZERO(fsync(ofd));
|
||||
#endif
|
||||
bees_fsync(ofd);
|
||||
}
|
||||
BEESNOTE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
|
||||
BEESTRACE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
|
||||
@@ -444,6 +502,23 @@ BeesTempFile::resize(off_t offset)
|
||||
// Count time spent here
|
||||
BEESCOUNTADD(tmp_resize_ms, resize_timer.age() * 1000);
|
||||
|
||||
// Modify flags - every time
|
||||
// - btrfs will keep trying to set FS_NOCOMP_FL behind us when compression heuristics identify
|
||||
// the data as compressible, but it fails to compress
|
||||
// - clear FS_NOCOW_FL because we can only dedupe between files with the same FS_NOCOW_FL state,
|
||||
// and we don't open FS_NOCOW_FL files for dedupe.
|
||||
BEESTRACE("Getting FS_COMPR_FL and FS_NOCOMP_FL on m_fd " << name_fd(m_fd));
|
||||
int flags = ioctl_iflags_get(m_fd);
|
||||
const auto orig_flags = flags;
|
||||
|
||||
flags |= FS_COMPR_FL;
|
||||
flags &= ~(FS_NOCOMP_FL | FS_NOCOW_FL);
|
||||
if (flags != orig_flags) {
|
||||
BEESTRACE("Setting FS_COMPR_FL and clearing FS_NOCOMP_FL | FS_NOCOW_FL on m_fd " << name_fd(m_fd) << " flags " << to_hex(flags));
|
||||
ioctl_iflags_set(m_fd, flags);
|
||||
}
|
||||
|
||||
// That may have queued some delayed ref deletes, so throttle them
|
||||
bees_throttle(resize_timer.age(), "tmpfile_resize");
|
||||
}
|
||||
|
||||
@@ -485,13 +560,6 @@ BeesTempFile::BeesTempFile(shared_ptr<BeesContext> ctx) :
|
||||
// Add this file to open_root_ino lookup table
|
||||
m_roots->insert_tmpfile(m_fd);
|
||||
|
||||
// Set compression attribute
|
||||
BEESTRACE("Getting FS_COMPR_FL on m_fd " << name_fd(m_fd));
|
||||
int flags = ioctl_iflags_get(m_fd);
|
||||
flags |= FS_COMPR_FL;
|
||||
BEESTRACE("Setting FS_COMPR_FL on m_fd " << name_fd(m_fd) << " flags " << to_hex(flags));
|
||||
ioctl_iflags_set(m_fd, flags);
|
||||
|
||||
// Count time spent here
|
||||
BEESCOUNTADD(tmp_create_ms, create_timer.age() * 1000);
|
||||
|
||||
@@ -683,7 +751,7 @@ bees_main(int argc, char *argv[])
|
||||
BEESLOGDEBUG("exception (ignored): " << s);
|
||||
BEESCOUNT(exception_caught_silent);
|
||||
} else {
|
||||
BEESLOGNOTICE("\n\n*** EXCEPTION ***\n\t" << s << "\n***\n");
|
||||
BEESLOG(BEES_TRACE_LEVEL, "TRACE: EXCEPTION: " << s);
|
||||
BEESCOUNT(exception_caught);
|
||||
}
|
||||
});
|
||||
@@ -704,9 +772,8 @@ bees_main(int argc, char *argv[])
|
||||
shared_ptr<BeesContext> bc = make_shared<BeesContext>();
|
||||
BEESLOGDEBUG("context constructed");
|
||||
|
||||
string cwd(readlink_or_die("/proc/self/cwd"));
|
||||
|
||||
// Defaults
|
||||
bool use_relative_paths = false;
|
||||
bool chatter_prefix_timestamp = true;
|
||||
double thread_factor = 0;
|
||||
unsigned thread_count = 0;
|
||||
@@ -778,7 +845,7 @@ bees_main(int argc, char *argv[])
|
||||
thread_min = stoul(optarg);
|
||||
break;
|
||||
case 'P':
|
||||
crucible::set_relative_path(cwd);
|
||||
use_relative_paths = true;
|
||||
break;
|
||||
case 'T':
|
||||
chatter_prefix_timestamp = false;
|
||||
@@ -796,7 +863,7 @@ bees_main(int argc, char *argv[])
|
||||
root_scan_mode = static_cast<BeesRoots::ScanMode>(stoul(optarg));
|
||||
break;
|
||||
case 'p':
|
||||
crucible::set_relative_path("");
|
||||
use_relative_paths = false;
|
||||
break;
|
||||
case 't':
|
||||
chatter_prefix_timestamp = true;
|
||||
@@ -866,18 +933,19 @@ bees_main(int argc, char *argv[])
|
||||
BEESLOGNOTICE("setting root path to '" << root_path << "'");
|
||||
bc->set_root_path(root_path);
|
||||
|
||||
// Set path prefix
|
||||
if (use_relative_paths) {
|
||||
crucible::set_relative_path(name_fd(bc->root_fd()));
|
||||
}
|
||||
|
||||
// Workaround for btrfs send
|
||||
bc->roots()->set_workaround_btrfs_send(workaround_btrfs_send);
|
||||
|
||||
// Set root scan mode
|
||||
bc->roots()->set_scan_mode(root_scan_mode);
|
||||
|
||||
if (root_scan_mode == BeesRoots::SCAN_MODE_EXTENT) {
|
||||
MultiLocker::enable_locking(false);
|
||||
} else {
|
||||
// Workaround for a kernel bug that the subvol-based crawlers keep triggering
|
||||
MultiLocker::enable_locking(true);
|
||||
}
|
||||
// Workaround for the logical-ino-vs-clone kernel bug
|
||||
MultiLocker::enable_locking(true);
|
||||
|
||||
// Start crawlers
|
||||
bc->start();
|
||||
|
28
src/bees.h
28
src/bees.h
@@ -122,9 +122,9 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
|
||||
// macros ----------------------------------------
|
||||
|
||||
#define BEESLOG(lv,x) do { if (lv < bees_log_level) { Chatter __chatter(lv, BeesNote::get_name()); __chatter << x; } } while (0)
|
||||
#define BEESLOGTRACE(x) do { BEESLOG(LOG_DEBUG, x); BeesTracer::trace_now(); } while (0)
|
||||
|
||||
#define BEESTRACE(x) BeesTracer SRSLY_WTF_C(beesTracer_, __LINE__) ([&]() { BEESLOG(LOG_ERR, x << " at " << __FILE__ << ":" << __LINE__); })
|
||||
#define BEES_TRACE_LEVEL LOG_DEBUG
|
||||
#define BEESTRACE(x) BeesTracer SRSLY_WTF_C(beesTracer_, __LINE__) ([&]() { BEESLOG(BEES_TRACE_LEVEL, "TRACE: " << x << " at " << __FILE__ << ":" << __LINE__); })
|
||||
#define BEESTOOLONG(x) BeesTooLong SRSLY_WTF_C(beesTooLong_, __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
|
||||
#define BEESNOTE(x) BeesNote SRSLY_WTF_C(beesNote_, __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
|
||||
|
||||
@@ -134,6 +134,14 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
|
||||
#define BEESLOGINFO(x) BEESLOG(LOG_INFO, x)
|
||||
#define BEESLOGDEBUG(x) BEESLOG(LOG_DEBUG, x)
|
||||
|
||||
#define BEESLOGONCE(__x) do { \
|
||||
static bool already_logged = false; \
|
||||
if (!already_logged) { \
|
||||
already_logged = true; \
|
||||
BEESLOGNOTICE(__x); \
|
||||
} \
|
||||
} while (false)
|
||||
|
||||
#define BEESCOUNT(stat) do { \
|
||||
BeesStats::s_global.add_count(#stat); \
|
||||
} while (0)
|
||||
@@ -185,7 +193,7 @@ class BeesTracer {
|
||||
thread_local static bool tl_silent;
|
||||
thread_local static bool tl_first;
|
||||
public:
|
||||
BeesTracer(function<void()> f, bool silent = false);
|
||||
BeesTracer(const function<void()> &f, bool silent = false);
|
||||
~BeesTracer();
|
||||
static void trace_now();
|
||||
static bool get_silent();
|
||||
@@ -521,7 +529,7 @@ class BeesCrawl {
|
||||
|
||||
bool fetch_extents();
|
||||
void fetch_extents_harder();
|
||||
bool restart_crawl();
|
||||
bool restart_crawl_unlocked();
|
||||
BeesFileRange bti_to_bfr(const BtrfsTreeItem &bti) const;
|
||||
|
||||
public:
|
||||
@@ -535,6 +543,7 @@ public:
|
||||
void deferred(bool def_setting);
|
||||
bool deferred() const;
|
||||
bool finished() const;
|
||||
bool restart_crawl();
|
||||
};
|
||||
|
||||
class BeesScanMode;
|
||||
@@ -543,7 +552,8 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
|
||||
shared_ptr<BeesContext> m_ctx;
|
||||
|
||||
BeesStringFile m_crawl_state_file;
|
||||
map<uint64_t, shared_ptr<BeesCrawl>> m_root_crawl_map;
|
||||
using CrawlMap = map<uint64_t, shared_ptr<BeesCrawl>>;
|
||||
CrawlMap m_root_crawl_map;
|
||||
mutex m_mutex;
|
||||
uint64_t m_crawl_dirty = 0;
|
||||
uint64_t m_crawl_clean = 0;
|
||||
@@ -562,7 +572,7 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
|
||||
condition_variable m_stop_condvar;
|
||||
bool m_stop_requested = false;
|
||||
|
||||
void insert_new_crawl();
|
||||
CrawlMap insert_new_crawl();
|
||||
Fd open_root_nocache(uint64_t root);
|
||||
Fd open_root_ino_nocache(uint64_t root, uint64_t ino);
|
||||
uint64_t transid_max_nocache();
|
||||
@@ -578,13 +588,14 @@ class BeesRoots : public enable_shared_from_this<BeesRoots> {
|
||||
void current_state_set(const BeesCrawlState &bcs);
|
||||
bool crawl_batch(shared_ptr<BeesCrawl> crawl);
|
||||
void clear_caches();
|
||||
|
||||
friend class BeesScanModeExtent;
|
||||
shared_ptr<BeesCrawl> insert_root(const BeesCrawlState &bcs);
|
||||
bool up_to_date(const BeesCrawlState &bcs);
|
||||
|
||||
friend class BeesCrawl;
|
||||
friend class BeesFdCache;
|
||||
friend class BeesScanMode;
|
||||
friend class BeesScanModeSubvol;
|
||||
friend class BeesScanModeExtent;
|
||||
|
||||
public:
|
||||
BeesRoots(shared_ptr<BeesContext> ctx);
|
||||
@@ -890,5 +901,6 @@ void bees_readahead_pair(int fd, off_t offset, size_t size, int fd2, off_t offse
|
||||
void bees_unreadahead(int fd, off_t offset, size_t size);
|
||||
void bees_throttle(double time_used, const char *context);
|
||||
string format_time(time_t t);
|
||||
bool exception_check();
|
||||
|
||||
#endif
|
||||
|
@@ -19,7 +19,9 @@ seeker_finder(const vector<uint64_t> &vec, uint64_t lower, uint64_t upper)
|
||||
if (ub != s.end()) ++ub;
|
||||
if (ub != s.end()) ++ub;
|
||||
for (; ub != s.end(); ++ub) {
|
||||
if (*ub > upper) break;
|
||||
if (*ub > upper) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
return set<uint64_t>(lb, ub);
|
||||
}
|
||||
@@ -28,7 +30,7 @@ static bool test_fails = false;
|
||||
|
||||
static
|
||||
void
|
||||
seeker_test(const vector<uint64_t> &vec, uint64_t const target)
|
||||
seeker_test(const vector<uint64_t> &vec, uint64_t const target, bool const always_out = false)
|
||||
{
|
||||
cerr << "Find " << target << " in {";
|
||||
for (auto i : vec) {
|
||||
@@ -36,11 +38,13 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
|
||||
}
|
||||
cerr << " } = ";
|
||||
size_t loops = 0;
|
||||
tl_seeker_debug_str = make_shared<ostringstream>();
|
||||
bool local_test_fails = false;
|
||||
bool excepted = catch_all([&]() {
|
||||
auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
|
||||
const auto found = seek_backward(target, [&](uint64_t lower, uint64_t upper) {
|
||||
++loops;
|
||||
return seeker_finder(vec, lower, upper);
|
||||
});
|
||||
}, uint64_t(32));
|
||||
cerr << found;
|
||||
uint64_t my_found = 0;
|
||||
for (auto i : vec) {
|
||||
@@ -52,13 +56,15 @@ seeker_test(const vector<uint64_t> &vec, uint64_t const target)
|
||||
cerr << " (correct)";
|
||||
} else {
|
||||
cerr << " (INCORRECT - right answer is " << my_found << ")";
|
||||
test_fails = true;
|
||||
local_test_fails = true;
|
||||
}
|
||||
});
|
||||
cerr << " (" << loops << " loops)" << endl;
|
||||
if (excepted) {
|
||||
test_fails = true;
|
||||
if (excepted || local_test_fails || always_out) {
|
||||
cerr << dynamic_pointer_cast<ostringstream>(tl_seeker_debug_str)->str();
|
||||
}
|
||||
test_fails = test_fails || local_test_fails;
|
||||
tl_seeker_debug_str.reset();
|
||||
}
|
||||
|
||||
static
|
||||
@@ -89,6 +95,39 @@ test_seeker()
|
||||
seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max());
|
||||
seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() }, numeric_limits<uint64_t>::max() - 1);
|
||||
seeker_test(vector<uint64_t> { 0, numeric_limits<uint64_t>::max() - 1 }, numeric_limits<uint64_t>::max());
|
||||
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 0);
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 1);
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 2);
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 3);
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 4);
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 5);
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 6);
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 7);
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 8);
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, 9);
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() );
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 1 );
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 2 );
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 3 );
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 4 );
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 5 );
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 6 );
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 7 );
|
||||
seeker_test(vector<uint64_t> { 0, 1, 2, 4, 8 }, numeric_limits<uint64_t>::max() - 8 );
|
||||
|
||||
// Pulled from a bees debug log
|
||||
seeker_test(vector<uint64_t> {
|
||||
6821962845,
|
||||
6821962848,
|
||||
6821963411,
|
||||
6821963422,
|
||||
6821963536,
|
||||
6821963539,
|
||||
6821963835, // <- appeared during the search, causing an exception
|
||||
6821963841,
|
||||
6822575316,
|
||||
}, 6821971036, true);
|
||||
}
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user