systemd 258 has the following changes noted in systemd.resource-control(5):
> `CPUAccounting=` setting is deprecated, because it is always available on the unified cgroup hierarchy and such setting has no effect.
Closes#327
Signed-off-by: benaryorg <binary@benary.org>
If the extent wasn't read in the last second, chances are high that
it was evicted from the page cache. If the extents have been evicted
from the cache by the time we grow or dedupe them, we'll take a serious
performance hit as we read them back in, one page at a time.
Use a 5-second delay to match the default writeback interval.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Sometimes there are absurdly large readahead requests (e.g. 32G),
which tie up a thread holding the readahead lock for a long time (not
to mention the IO the reading hammers the rest of the system with).
These are likely an artifact of the legacy ExtentWalker code interacting
with concurrent filesystem changes.
The maximum btrfs extent size is 128M, so cap the length of readahead
requests at that size.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add a cheap check for `FS_NOCOW_FL` when we first encounter
each extent. In the raw btrfs inode flags, the offending flag is
`BTRFS_INODE_NODATASUM`, because the restriction that prevents reflink
between datacow and "nodatacow" files is that a single inode is allowed
to have csums or not have csums, but must apply that choice to _all_
of its extents.
This extra check is cheaper than opening a file for each individual
reference to the extent, and then discovering that the file is
`FS_NOCOW_FL`, and then closing the file, over and over again. It will
also avoid emitting a lot of noisy log messages.
Fixes: https://github.com/Zygo/bees/issues/313
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
With new kernel headers and old libc, `SYS_openat2` can still end up
undefined, which triggers the fallback build-time code, that doesn't build:
```
openat2.cc: In function 'int openat2(int, const char*, open_how*, size_t)':
openat2.cc:35:2: error: 'errno' was not declared in this scope
35 | errno = ENOSYS;
| ^~~~~
openat2.cc:24:1: note: 'errno' is defined in header '<cerrno>'; did you forget to '#include <cerrno>'?
23 | #include <unistd.h>
+++ |+#include <cerrno>
24 |
openat2.cc:35:10: error: 'ENOSYS' was not declared in this scope
35 | errno = ENOSYS;
| ^~~~~~
openat2.cc:29:19: error: unused parameter 'dirfd' [-Werror=unused-parameter]
29 | openat2(int const dirfd, const char *const pathname, struct open_how *const how, size_t const size)
| ~~~~~~~~~~^~~~~
openat2.cc:29:44: error: unused parameter 'pathname' [-Werror=unused-parameter]
29 | openat2(int const dirfd, const char *const pathname, struct open_how *const how, size_t const size)
| ~~~~~~~~~~~~~~~~~~^~~~~~~~
openat2.cc:29:77: error: unused parameter 'how' [-Werror=unused-parameter]
29 | t dirfd, const char *const pathname, struct open_how *const how, size_t const size)
| ~~~~~~~~~~~~~~~~~~~~~~~^~~
openat2.cc:29:95: error: unused parameter 'size' [-Werror=unused-parameter]
29 | st char *const pathname, struct open_how *const how, size_t const size)
| ~~~~~~~~~~~~~^~~~
```
Skip the kernel version check and test for the definition of `SYS_openat2`
directly. If it's not there, plug in the constant so we can send the
call directly to the kernel, bypassing libc completely.
Fixes: https://github.com/Zygo/bees/issues/318
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
With the "idle" tag moved out of the `point` column, a `point` value of
1000000 may become visible--and push the table one column to the right.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This means the progress table in the status output reflects the state of
the oldest task in the queue, not the newest.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Adds a `BINDIR` Make variable, defaulting to `sbin`, allowing packagers
to override the install location of `beesd` for systems that do not use
`/sbin`. This affects the install path and systemd unit template.
Build fails on 32-bit Slackware because GCC 11's `-Werror=sign-compare`
is stricter than necessary:
cc -Wall -Wextra -Werror -O3 -I../include -D_FILE_OFFSET_BITS=64 -std=c99 -O2 -march=i586 -mtune=i686 -o bees-version.o -c bees-version.c
bees.cc: In function 'void bees_fsync(int)':
bees.cc:426:24: error: comparison of integer expressions of different signedness: '__fsword_t' {aka 'int'} and 'unsigned int' [-Werror=sign-compare]
426 | if (stf.f_type != BTRFS_SUPER_MAGIC) {
| ^
To work around this, cast `stf.f_type` to the same type as
`BTRFS_SUPER_MAGIC`, so it has the same number of bits that we're looking
for in the magic value.
Fixes: https://github.com/Zygo/bees/issues/317
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A small performance optimization, given that we are constantly clobbering
the file with new content.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
btrfs will set the FS_NOCOMP_FL flag when all of the following are true:
1. The filesystem is not mounted with the `compress-force` option
2. Heuristic analysis of the data suggests the data is compressible
3. Compression fails to produce a result that is smaller than the original
If the compression ratio is 40%, and the original data is 128K long,
then compressed data will be about 52K long (rounded up to 4K), so item
3 is usually false; however, if the original data is 8K long, then the
compressed data will be 8K long too, and btrfs will set FS_NOCOMP_FL.
To work around that, keep setting FS_COMPR_FL and clearing FS_NOCOMP_FL
every time a TempFile is reset.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
FS_NOCOW_FL can be inherited from the subvol root directory, and it
conflicts with FS_COMPR_FL.
We can only dedupe when FS_NOCOW_FL is the same on src and dst, which
means we can only dedupe when FS_NOCOW_FL is clear, so we should clear
FS_NOCOW_FL on the temporary files we create for dedupe.
Fixes: https://github.com/Zygo/bees/issues/314
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Apply the "idle" label only when the crawl is finished _and_ its
transid_max is up to date. This makes the keyword "idle" better reflect
when bees is not only finished crawling, but also scanning the crawled
extents in the queue.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When all extents within a size tier have been queued, and all the
extents belong to the same file, the queue might take a long time to
fully process. Also, any progress that is made will be obscured by
the "idle" tag in the "point" column.
Move "idle" to the next cycle ETA column, since the ETA duration will
be zero, and no useful information is lost since we would have "-"
there anyway.
Since the "point" column can now display the maximum value, lower
that maximum to 999999 so that we don't use an extra column.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
With subvol scan, the crawl task name is the subvol/inode pair
corresponding to the file offset in the log message. The identity of
the file can be determined by looking up the subvol/inode pair in the
log message.
With extent scan, the crawl task name is the extent bytenr corresponding
to the file offset in the log message. This extent is deleted when the
log message is emitted, so a later lookup on the extent bytenr will not
find any references to the extent, and the identity of the file cannot
be determined.
Log the bfr, which does a /proc lookup on the name of the fd, so the
filename is logged.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
During the search, the region between `upper_bound` and `target_pos`
should contain no data items. The search lowers `upper_bound` and raises
`lower_bound` until they both point to the last item before `target_pos`.
The `lower_bound` is increased to the position of the last item returned
by a search (`high_pos`) when that item is lower than `target_pos`.
This avoids some loop iterations compared to a strict binary search
algorithm, which would increase `lower_bound` only as far as `probe_pos`.
When the search runs over live extent items, occasionally a new extent
will appear between `upper_bound` and `target_pos`. When this happens,
`lower_bound` is bumped up to the position of one of the new items, but
that position is in the "unoccupied" space between `upper_bound` and
`target_pos`, where no items are supposed to exist, so `seek_backward`
throws an exception.
To cut down on the noise, only increase `lower_bound` as far as
`upper_bound`. This avoids the exception without increasing the number
of loop iterations for normal cases.
In the exceptional cases, extra loop iterations are needed to skip over
the new items. This raises the worst-case number of loop iterations
by one.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Send both tree_search ioctl and `seek_backward` debug logs to the
same output stream, but only write that stream to the debug log if
there is an exception.
The feature remains disabled at compile time.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Note that when enabled, the logs are still very CPU-intensive,
but most of the logs will be discarded.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Commit d32f31f411 ("btrfs-tree: harden
`rlower_bound` against exceptional objects") passes the first btrfs item
in the result set that is above upper_bound up to `seek_backward`.
This is somewhat wasteful as `seek_backward` cannot use such a result.
Reverse that change in behavior, while keeping the rest of the other
commit.
This introduces a new case, where the search ioctl is producing items
that are above upper bound, but there are no items in the result set,
which continues looping until the end of the filesystem is reached.
Handle that by setting an explicit exit variable.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This allows detailed but selective debugging when using the library,
particularly when something goes wrong.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The comment describes an earlier version which submitted each extent
ref as a separate Task, but now all extent refs are handled by the same
Task to minimize the amount of time between processing the first and
last reference to an extent.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Previously `scan()` would run the extent scan loop once, and enqueue one
extent, before checking for throttling. Do an extra check before that,
and bail out so that zero extents are enqueued when throttled.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Linux kernel thread names are hardcoded at 16 characters. Every character
counts, and "0x" wastes two.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The "tm_left" field was the estimated _total_ duration of the crawl,
not the amount of time remaining. The ETA timestamp was then calculated
based on the estimated time to run the crawl if it started _now_, not
at the start timestamp.
Fix the duration and ETA calculations.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The status file contains sensitive information like filenames and duplicate chunk ranges. It might also make sense to set the process-wide `UMask=`, but that may have other unintended side effects.
The `_nothrow` variants of `do_ioctl` return true when they succeed,
which is the opposite of what `ioctl` does.
Fix the logic so bees can correctly identify its own hash table when
it's on the same filesystem as the target.
Fixes: f6908420ad ("hash: handle $BEESHOME on non-btrfs")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Commit 183b6a5361 ("extent scan: refactor
BeesCrawl, BeesScanMode*") moved some statistics calculations out of
the loop in `find_next_extent`, but did not ensure that the statistics
would not be calculated if the loop had not executed any iterations.
In rare instances, the function returns without entering the loop at all,
which results in divide by zero. Add a check just before doing that.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Exceptions were logged at level NOTICE while the stack traces were logged
at level DEBUG. That produced useless noise in the output with `-v5`
or `-v6`, where there were exception headings logged, but no details.
Fix that by placing the exceptions and traces at level DEBUG, but prefix
them with `TRACE:` for easy grepping.
Most of the events associated with BEESLOGTRACE either never happen,
or they are harmless (e.g. trying to open deleted files or subvols).
Reassign them to ordinary BEESLOGDEBUG, with one exception for
unrecognized Extent flags that should be debugged if any appear.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
While investigating https://github.com/Zygo/bees/issues/282 I noticed that
we're doing at least one unnecessary extra copy of the functor in BEESTRACE.
Get rid of it with a const reference.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Older kernel versions featured some bugs in btrfs `fsync`, which could
leave behind "ghost dirents", orphan filename items that did not have
a corresponding inode. These dirents were created during log replay
during the first mount after a crash due to several different bugs in
the log tree and its use over the years. The last known bug of this
kind was fixed in kernel 5.16. As of this writing, no fixes for this
bug have been backported to any earlier LTS kernel.
Some filesystems, including btrfs, will flush the contents of a new
file before renaming it over an old file. On paper, btrfs can do this
very cheaply since the contents of the new file are not referenced, and
the old file not dereferenced, until a tree commit which includes both
actions atomically; however, in real life, btrfs provides `fsync`-like
semantics and uses the log-tree infrastructure to implement them, which
compromises performance and acts as a magnet for bugs.
The benefit of this trade-off is that `rename` can be used as a
synchronization point for data outside of the btrfs, which would not
happen if everything `rename` does was simply deferred to the next
tree commit. The cost of this trade-off is that for the first 8 years
of its existence, bees would trigger the bug so often that the project
recommended its users put $BEESHOME in its own subvol to make it easy
to remove ghost dirents left behind by the bug.
Some other filesystems, such as xfs, don't have any special semantics for
`rename`, and require `fsync` to avoid garbage or missing data after
a crash. Even filesystems which do have a special case for `rename`
can be configured to turn it off.
btrfs will silently delete data from files in the event that an
unrecoverable data block write error occurs. Kernel version 6.2 adds
important new and unexpected cases where this can happen on filesystems
using raid56 data, but it also happens in all usable btrfs versions
(the silent deletion behavior was introduced in kernel version 3.9).
Unrecoverable write errors are currently reported to userspace only
through `fsync`. Since the failed extents are deleted, they cannot be
detected via csum failures or scrub after the fact--and it's too late
by then, the data is already gone. `fsync` is the last opportunity
to detect the write failure before the `rename`. If the error is not
detected, the contents of the file will be silently discarded in btrfs.
The impact on bees is that scans will abruptly restart from zero after
a crash combined with some other reasonably common failures.
Putting all of this together leads to a rather complex workaround:
if the filesystem under $BEESHOME (specifically, the filesystem where
BeesStringFile objects such as `beescrawl.dat` are written) is a btrfs
filesystem, and the host kernel is a version prior to 5.16, then don't
call `fsync` before `rename`. In all other cases, do call `fsync`,
and prevent dependent writes (i.e. the following `rename`) in the event
of errors.
Since present kernel versions still require `fsync`, we don't need
an upper bound on the kernel version check until someone fixes btrfs
`rename` (or perhaps adds a flag to `renameat2` which prevents use of
the log tree) in the kernel. Once that fix happens, we can drop the
`fsync` call for kernels after that fixed version.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Another hit from the exotic compiler collection: build fails on GCC 9,
from Ubuntu 20...but not later versions of GCC.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Fix the missing symbols that popped up when adding chunk tree to
lib/fs.cc. Also define the missing symbols instead of merely trying to
avoid them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This obviously doesn't fix or prevent the kernel bug, but it does prevent
bees from triggering the bug without assitance from another application.
The bug can still be triggered by running bees at the same time as an
application which uses clone or LOGICAL_INO. `btdu` uses LOGICAL_INO,
while `cp` from coreutils (and many others) use clone (reflink copy).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In commit 31b2aa3c0d ("context: speed
up orderly process termination"), the stop request was split into two
methods after the mutex unlock.
Now that there's nothing after the mutex unlock in `stop_request`,
there's no need for an explicit unlock to do what the destructor would
have done anyway.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Parallel scan runs each extent size tier in a separate thread. The
threads compete to process extents within the tier's size range.
Ordered scan processes each extent size tier completely before moving on
to the next. In theory, this means large extents always get processed
quickly, especially when new ones appear, and the queue does not fill up
with small extents.
In practice, the multi-threaded scanner massively outperforms the
single-threaded scanner, unless the number of worker threads is very
small (i.e. one).
Disable most of the feature for now, but leave the code in place so it
can be easily reactivated for future testing.
Ordered scan introduces a parallelized extent mapper Task. Keep that in
parallel scan mode, which further enhances the parallelism. The extent
scan crawl threads now run at 'idle' priority while the map tasks run
at normal priority, so the map tasks don't flood the task queue.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesScanModeExtent can do that by itself now. Overloading the subvol
crawl code resulted in an ugly, inefficient hack, and we definitely
don't want to accidentally continue to use it.
Remove the support for reading the extent tree and add some `assert`s
to make sure it isn't still used somewhere.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The main gains here are:
* Move extent tree searches into BeesScanModeExtent so that they are
not slowed down by the BeesCrawl code, which was designed for the
much more specialized metadata in subvol trees.
* Enable short extent skipping now that BeesCrawl is out of the way.
* Stop enumerating btrfs subvols when in extent scan mode.
All this gets rid of >99% of unnecessary extent tree searches.
Incremental extent scan cycles now finish in milliseconds instead
of minutes.
BeesCrawl was never designed to cope with the structure and content of
the extent tree. It would waste thousands of tree-search ioctl calls
reading and ignoring metadata items.
Performance was particularly bad when a binary search was involved, as any
binary search probe that landed in a metadata block group would read and
discard all the metadata items in the block group, sequentially, repeated
for each level of the binary search. This was blocking implementation of
short extent skipping optimization for large extent size tiers, because
the skips were using thousands of tree searches to skip over only a few
hundred extent items.
Extent scan also had to read every extent item twice to do the
transid filtering, because BeesCrawl's interface discarded the relevant
information when it converted a `BtrfsTreeItem` into a `BeesFileRange`.
The cost of this extra fetch was negligible, but it could have been zero.
Fix this by:
* Copy the equivalent of `fetch_extents` from BeesCrawl into
`BeesScanModeExtent`, then give each of the extent scan crawlers its
own `BtrfsDataExtentTreeFetcher` instance. This enables extent tree
searches to avoid pure (non-mixed) metadata block groups. `BeesCrawl`
is now used only for its interface to `BeesRoots` for saving state in
`beescrawl.dat`, and never to determine the next extent tree item.
* Move subvol-specific parts of `BeesRoots` into a new class
`BeesScanModeSubvol` so that `BtrfsScanModeExtent` doesn't have to enable
or support them. In particular, `bees -m4` no longer enumerates all
of the _subvol_ crawlers. `BeesRoots` is still used to save and load
crawl state.
* Move several members from `BtrfsScanModeExtent` into a per-crawler
state object `SizeTier` to eliminate the need for some locks and to
maintain separate cache state for `BtrfsDataExtentTreeFetcher`.
* Reuse the `BtrfsTreeItem` to get the generation field for the transid
range filter.
* Avoid a few corner cases when handling errors, where extent scan might
drop an extent without scanning it, or fail to advance to the next extent.
* Enable the extent-skipping algorithm for large size tiers, now that
`BeesCrawl::fetch_extents` is no longer slowing it down.
* Add a debug stream interface which developers can easily turn on when
needed to inspect the decisions that extent scan is making.
* Track metrics that are more useful, particularly searches per extent
scanned, and fraction of extents that are skipped.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This gets rid of one open-coded btrfs tree search.
Also reduce the log noise level for subvol open failures, and remove
some ancient references to `BEESLOG`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>