Adds a `BINDIR` Make variable, defaulting to `sbin`, allowing packagers
to override the install location of `beesd` for systems that do not use
`/sbin`. This affects the install path and systemd unit template.
Build fails on 32-bit Slackware because GCC 11's `-Werror=sign-compare`
is stricter than necessary:
cc -Wall -Wextra -Werror -O3 -I../include -D_FILE_OFFSET_BITS=64 -std=c99 -O2 -march=i586 -mtune=i686 -o bees-version.o -c bees-version.c
bees.cc: In function 'void bees_fsync(int)':
bees.cc:426:24: error: comparison of integer expressions of different signedness: '__fsword_t' {aka 'int'} and 'unsigned int' [-Werror=sign-compare]
426 | if (stf.f_type != BTRFS_SUPER_MAGIC) {
| ^
To work around this, cast `stf.f_type` to the same type as
`BTRFS_SUPER_MAGIC`, so it has the same number of bits that we're looking
for in the magic value.
Fixes: https://github.com/Zygo/bees/issues/317
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A small performance optimization, given that we are constantly clobbering
the file with new content.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
btrfs will set the FS_NOCOMP_FL flag when all of the following are true:
1. The filesystem is not mounted with the `compress-force` option
2. Heuristic analysis of the data suggests the data is compressible
3. Compression fails to produce a result that is smaller than the original
If the compression ratio is 40%, and the original data is 128K long,
then compressed data will be about 52K long (rounded up to 4K), so item
3 is usually false; however, if the original data is 8K long, then the
compressed data will be 8K long too, and btrfs will set FS_NOCOMP_FL.
To work around that, keep setting FS_COMPR_FL and clearing FS_NOCOMP_FL
every time a TempFile is reset.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
FS_NOCOW_FL can be inherited from the subvol root directory, and it
conflicts with FS_COMPR_FL.
We can only dedupe when FS_NOCOW_FL is the same on src and dst, which
means we can only dedupe when FS_NOCOW_FL is clear, so we should clear
FS_NOCOW_FL on the temporary files we create for dedupe.
Fixes: https://github.com/Zygo/bees/issues/314
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Apply the "idle" label only when the crawl is finished _and_ its
transid_max is up to date. This makes the keyword "idle" better reflect
when bees is not only finished crawling, but also scanning the crawled
extents in the queue.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When all extents within a size tier have been queued, and all the
extents belong to the same file, the queue might take a long time to
fully process. Also, any progress that is made will be obscured by
the "idle" tag in the "point" column.
Move "idle" to the next cycle ETA column, since the ETA duration will
be zero, and no useful information is lost since we would have "-"
there anyway.
Since the "point" column can now display the maximum value, lower
that maximum to 999999 so that we don't use an extra column.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
With subvol scan, the crawl task name is the subvol/inode pair
corresponding to the file offset in the log message. The identity of
the file can be determined by looking up the subvol/inode pair in the
log message.
With extent scan, the crawl task name is the extent bytenr corresponding
to the file offset in the log message. This extent is deleted when the
log message is emitted, so a later lookup on the extent bytenr will not
find any references to the extent, and the identity of the file cannot
be determined.
Log the bfr, which does a /proc lookup on the name of the fd, so the
filename is logged.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
During the search, the region between `upper_bound` and `target_pos`
should contain no data items. The search lowers `upper_bound` and raises
`lower_bound` until they both point to the last item before `target_pos`.
The `lower_bound` is increased to the position of the last item returned
by a search (`high_pos`) when that item is lower than `target_pos`.
This avoids some loop iterations compared to a strict binary search
algorithm, which would increase `lower_bound` only as far as `probe_pos`.
When the search runs over live extent items, occasionally a new extent
will appear between `upper_bound` and `target_pos`. When this happens,
`lower_bound` is bumped up to the position of one of the new items, but
that position is in the "unoccupied" space between `upper_bound` and
`target_pos`, where no items are supposed to exist, so `seek_backward`
throws an exception.
To cut down on the noise, only increase `lower_bound` as far as
`upper_bound`. This avoids the exception without increasing the number
of loop iterations for normal cases.
In the exceptional cases, extra loop iterations are needed to skip over
the new items. This raises the worst-case number of loop iterations
by one.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Send both tree_search ioctl and `seek_backward` debug logs to the
same output stream, but only write that stream to the debug log if
there is an exception.
The feature remains disabled at compile time.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Note that when enabled, the logs are still very CPU-intensive,
but most of the logs will be discarded.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Commit d32f31f411 ("btrfs-tree: harden
`rlower_bound` against exceptional objects") passes the first btrfs item
in the result set that is above upper_bound up to `seek_backward`.
This is somewhat wasteful as `seek_backward` cannot use such a result.
Reverse that change in behavior, while keeping the rest of the other
commit.
This introduces a new case, where the search ioctl is producing items
that are above upper bound, but there are no items in the result set,
which continues looping until the end of the filesystem is reached.
Handle that by setting an explicit exit variable.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This allows detailed but selective debugging when using the library,
particularly when something goes wrong.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The comment describes an earlier version which submitted each extent
ref as a separate Task, but now all extent refs are handled by the same
Task to minimize the amount of time between processing the first and
last reference to an extent.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Previously `scan()` would run the extent scan loop once, and enqueue one
extent, before checking for throttling. Do an extra check before that,
and bail out so that zero extents are enqueued when throttled.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Linux kernel thread names are hardcoded at 16 characters. Every character
counts, and "0x" wastes two.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The "tm_left" field was the estimated _total_ duration of the crawl,
not the amount of time remaining. The ETA timestamp was then calculated
based on the estimated time to run the crawl if it started _now_, not
at the start timestamp.
Fix the duration and ETA calculations.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The status file contains sensitive information like filenames and duplicate chunk ranges. It might also make sense to set the process-wide `UMask=`, but that may have other unintended side effects.
The `_nothrow` variants of `do_ioctl` return true when they succeed,
which is the opposite of what `ioctl` does.
Fix the logic so bees can correctly identify its own hash table when
it's on the same filesystem as the target.
Fixes: f6908420ad ("hash: handle $BEESHOME on non-btrfs")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Commit 183b6a5361 ("extent scan: refactor
BeesCrawl, BeesScanMode*") moved some statistics calculations out of
the loop in `find_next_extent`, but did not ensure that the statistics
would not be calculated if the loop had not executed any iterations.
In rare instances, the function returns without entering the loop at all,
which results in divide by zero. Add a check just before doing that.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Exceptions were logged at level NOTICE while the stack traces were logged
at level DEBUG. That produced useless noise in the output with `-v5`
or `-v6`, where there were exception headings logged, but no details.
Fix that by placing the exceptions and traces at level DEBUG, but prefix
them with `TRACE:` for easy grepping.
Most of the events associated with BEESLOGTRACE either never happen,
or they are harmless (e.g. trying to open deleted files or subvols).
Reassign them to ordinary BEESLOGDEBUG, with one exception for
unrecognized Extent flags that should be debugged if any appear.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
While investigating https://github.com/Zygo/bees/issues/282 I noticed that
we're doing at least one unnecessary extra copy of the functor in BEESTRACE.
Get rid of it with a const reference.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Older kernel versions featured some bugs in btrfs `fsync`, which could
leave behind "ghost dirents", orphan filename items that did not have
a corresponding inode. These dirents were created during log replay
during the first mount after a crash due to several different bugs in
the log tree and its use over the years. The last known bug of this
kind was fixed in kernel 5.16. As of this writing, no fixes for this
bug have been backported to any earlier LTS kernel.
Some filesystems, including btrfs, will flush the contents of a new
file before renaming it over an old file. On paper, btrfs can do this
very cheaply since the contents of the new file are not referenced, and
the old file not dereferenced, until a tree commit which includes both
actions atomically; however, in real life, btrfs provides `fsync`-like
semantics and uses the log-tree infrastructure to implement them, which
compromises performance and acts as a magnet for bugs.
The benefit of this trade-off is that `rename` can be used as a
synchronization point for data outside of the btrfs, which would not
happen if everything `rename` does was simply deferred to the next
tree commit. The cost of this trade-off is that for the first 8 years
of its existence, bees would trigger the bug so often that the project
recommended its users put $BEESHOME in its own subvol to make it easy
to remove ghost dirents left behind by the bug.
Some other filesystems, such as xfs, don't have any special semantics for
`rename`, and require `fsync` to avoid garbage or missing data after
a crash. Even filesystems which do have a special case for `rename`
can be configured to turn it off.
btrfs will silently delete data from files in the event that an
unrecoverable data block write error occurs. Kernel version 6.2 adds
important new and unexpected cases where this can happen on filesystems
using raid56 data, but it also happens in all usable btrfs versions
(the silent deletion behavior was introduced in kernel version 3.9).
Unrecoverable write errors are currently reported to userspace only
through `fsync`. Since the failed extents are deleted, they cannot be
detected via csum failures or scrub after the fact--and it's too late
by then, the data is already gone. `fsync` is the last opportunity
to detect the write failure before the `rename`. If the error is not
detected, the contents of the file will be silently discarded in btrfs.
The impact on bees is that scans will abruptly restart from zero after
a crash combined with some other reasonably common failures.
Putting all of this together leads to a rather complex workaround:
if the filesystem under $BEESHOME (specifically, the filesystem where
BeesStringFile objects such as `beescrawl.dat` are written) is a btrfs
filesystem, and the host kernel is a version prior to 5.16, then don't
call `fsync` before `rename`. In all other cases, do call `fsync`,
and prevent dependent writes (i.e. the following `rename`) in the event
of errors.
Since present kernel versions still require `fsync`, we don't need
an upper bound on the kernel version check until someone fixes btrfs
`rename` (or perhaps adds a flag to `renameat2` which prevents use of
the log tree) in the kernel. Once that fix happens, we can drop the
`fsync` call for kernels after that fixed version.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Another hit from the exotic compiler collection: build fails on GCC 9,
from Ubuntu 20...but not later versions of GCC.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Fix the missing symbols that popped up when adding chunk tree to
lib/fs.cc. Also define the missing symbols instead of merely trying to
avoid them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This obviously doesn't fix or prevent the kernel bug, but it does prevent
bees from triggering the bug without assitance from another application.
The bug can still be triggered by running bees at the same time as an
application which uses clone or LOGICAL_INO. `btdu` uses LOGICAL_INO,
while `cp` from coreutils (and many others) use clone (reflink copy).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In commit 31b2aa3c0d ("context: speed
up orderly process termination"), the stop request was split into two
methods after the mutex unlock.
Now that there's nothing after the mutex unlock in `stop_request`,
there's no need for an explicit unlock to do what the destructor would
have done anyway.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Parallel scan runs each extent size tier in a separate thread. The
threads compete to process extents within the tier's size range.
Ordered scan processes each extent size tier completely before moving on
to the next. In theory, this means large extents always get processed
quickly, especially when new ones appear, and the queue does not fill up
with small extents.
In practice, the multi-threaded scanner massively outperforms the
single-threaded scanner, unless the number of worker threads is very
small (i.e. one).
Disable most of the feature for now, but leave the code in place so it
can be easily reactivated for future testing.
Ordered scan introduces a parallelized extent mapper Task. Keep that in
parallel scan mode, which further enhances the parallelism. The extent
scan crawl threads now run at 'idle' priority while the map tasks run
at normal priority, so the map tasks don't flood the task queue.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesScanModeExtent can do that by itself now. Overloading the subvol
crawl code resulted in an ugly, inefficient hack, and we definitely
don't want to accidentally continue to use it.
Remove the support for reading the extent tree and add some `assert`s
to make sure it isn't still used somewhere.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The main gains here are:
* Move extent tree searches into BeesScanModeExtent so that they are
not slowed down by the BeesCrawl code, which was designed for the
much more specialized metadata in subvol trees.
* Enable short extent skipping now that BeesCrawl is out of the way.
* Stop enumerating btrfs subvols when in extent scan mode.
All this gets rid of >99% of unnecessary extent tree searches.
Incremental extent scan cycles now finish in milliseconds instead
of minutes.
BeesCrawl was never designed to cope with the structure and content of
the extent tree. It would waste thousands of tree-search ioctl calls
reading and ignoring metadata items.
Performance was particularly bad when a binary search was involved, as any
binary search probe that landed in a metadata block group would read and
discard all the metadata items in the block group, sequentially, repeated
for each level of the binary search. This was blocking implementation of
short extent skipping optimization for large extent size tiers, because
the skips were using thousands of tree searches to skip over only a few
hundred extent items.
Extent scan also had to read every extent item twice to do the
transid filtering, because BeesCrawl's interface discarded the relevant
information when it converted a `BtrfsTreeItem` into a `BeesFileRange`.
The cost of this extra fetch was negligible, but it could have been zero.
Fix this by:
* Copy the equivalent of `fetch_extents` from BeesCrawl into
`BeesScanModeExtent`, then give each of the extent scan crawlers its
own `BtrfsDataExtentTreeFetcher` instance. This enables extent tree
searches to avoid pure (non-mixed) metadata block groups. `BeesCrawl`
is now used only for its interface to `BeesRoots` for saving state in
`beescrawl.dat`, and never to determine the next extent tree item.
* Move subvol-specific parts of `BeesRoots` into a new class
`BeesScanModeSubvol` so that `BtrfsScanModeExtent` doesn't have to enable
or support them. In particular, `bees -m4` no longer enumerates all
of the _subvol_ crawlers. `BeesRoots` is still used to save and load
crawl state.
* Move several members from `BtrfsScanModeExtent` into a per-crawler
state object `SizeTier` to eliminate the need for some locks and to
maintain separate cache state for `BtrfsDataExtentTreeFetcher`.
* Reuse the `BtrfsTreeItem` to get the generation field for the transid
range filter.
* Avoid a few corner cases when handling errors, where extent scan might
drop an extent without scanning it, or fail to advance to the next extent.
* Enable the extent-skipping algorithm for large size tiers, now that
`BeesCrawl::fetch_extents` is no longer slowing it down.
* Add a debug stream interface which developers can easily turn on when
needed to inspect the decisions that extent scan is making.
* Track metrics that are more useful, particularly searches per extent
scanned, and fraction of extents that are skipped.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This gets rid of one open-coded btrfs tree search.
Also reduce the log noise level for subvol open failures, and remove
some ancient references to `BEESLOG`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Rearrange the logic in `rlower_bound` so it can cope with a tree
that contains mostly block-aligned objects, with a few exceptions
filtered out by `hdr_stop`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In some cases functions already had existing debug stream support
which can be redirected to the new interface. In other cases, new
debug messages are added.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This allows plugging in an ostream at run time so that we can audit all
the search calls we are doing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BtrfsFsTreeFetcher was used for early versions of the extent scanner, but
neither subvol nor extent scan now needs an object that is both persistent
and configured to access only one subvol. BtrfsExtentDataFetcher does
the same thing in that case.
Clarify the comments on what the remaining classes do, so that
BtrfsFsTreeFetcher doesn't get inadvertently reinvented in the future.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Binary searches can be extremely slow if the target bytenr is near a
metadata block group, because metadata items are not visible to the
binary search algorithm. In a non-mixed-bg filesystem, there can be
hundreds of thousands of metadata items between data extent items, and
since the binary search algorithm can't see them, it will run searches
that iterate over hundreds of thousands of objects about a dozen times.
This is less of a problem for mixed-bg filesystems because the data and
metadata blocks are not isolated from each other. The binary search
algorithm still can't see the metadata items, but there are usually
some data items close by to prevent the linear item filter from running
too long.
Introduce a new fetcher class (all the good names were taken) that tracks
where the end of the current block group is. When the end of the current
block group is reached in the linear search, skip ahead to a block group
that can contain data items.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The cwd is where core dumps and various profiling and verification
libraries want to write their data, whereas root_fd is the root of the
target filesystem. These are often intentionally different. When
they are different, `--strip-paths` sets the wrong prefix to strip
from paths.
Once the root fd has been established, we can set the path prefix to
the string prefix that we'll get from future calls to `name_fd`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
bees explicitly supports storing $BEESHOME on another filesystem, and
does not require that filesystem to be btrfs; however, if $BEESHOME
is on a non-btrfs filesystem, there is an exception on every startup
when trying to identify the subvol root of the hash table file in order
to blacklist it, because non-btrfs filesystems don't have subvol roots.
Fix by checking not only whether $BEESHOME is on btrfs, but whether it
is on the _same_ btrfs, as the bees root, without throwing an exception.
The hash table is blacklisted only when both filesystems are btrfs and
have the same fsid.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Enable use of the ioctl to probe whether two fds refer to the same btrfs,
without throwing an exception.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The debug log is only revealed when something goes wrong, but it is
created and discarded every time `seek_backward` is called, and it
is quite CPU-intensive.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Remove dubious comments and #if 0 section. Document new event counters,
and add one for read failures.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
1% is a lot of data on a petabyte filesystem, and a long time to wait for an
ETA.
After 1 GiB we should have some idea of how fast we're reading the data.
Increase the time to 10 seconds to avoid a nonsense result just after a scan
starts.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
* `nodev`: This reduces rename attack surface by preventing bees from
opening any device file on the target filesystem.
* `noexec`: This prevents access to the mount point from being leveraged
to execute setuid binaries, or execute anything at all through the
mount point.
These options are not required because they duplicate features in the
bees binary (assuming that the mount namespace remains private):
* `noatime`: bees always opens every file with `O_NOATIME`, making
this option redundant.
* `nosymfollow`: bees uses `openat2` on kernels 5.6 and later with
flags that prevent symlink attacks. `nosymfollow` was introduced in
kernel 5.10, so every kernel that can do `nosymfollow` can already do
`openat2`. Also, historically, `$BEESHOME` can be a relative path with
symlinks in any path component except the last one, and `nosymfollow`
doesn't allow that.
Between `openat2` and `nodev`, all symlink attacks are prevented, and
rename attacks cannot be used to force bees to open a device file.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We _recommend_ that `$BEESHOME` should be a subvol, and we'll create a
subvol if no directory exists; however, there's no reason to reject an
existing plain directory if the user chooses to use one.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If starting the beesd script without systemd, the mount point won't
automatically unmount if the script is cancelled with ctrl+c.
Fixes: https://github.com/Zygo/bees/issues/281
Signed-off-by: Kai Krakow <kai@kaishome.de>
There's a pathological case where all of the extent scan crawlers except
one are at the end of a crawl cycle, but the one crawler that is still
running is keeping the Task queue full. The result is that bees never
starts the other extent scan crawlers, because the queue is always
full at the instant a new transid triggers the start of a new scan.
That's bad because it will result in bees falling behind when new data
from the inactive size tiers appears.
To fix this, check for throttling _after_ creating at least one scan task
in each crawler. That will keep the crawlers running, and possibly allow
them to claw back some space in the Task queue. It slightly overcommits
the Task queue, so there will be a few more Tasks than nominally allowed.
Also (re)introduce some hysteresis in the queue size limit and reduce it
a little, so that bees isn't continually stopping and restarting crawls
every time one task is created or completed, and so that we stay under
the configured Task limit despite overcommitting.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Toxic extent workarounds are going away because the underlying kernel
bugs have been fixed. They are no longer worthy of spamming non-developer
logs.
INO_PATHS can return no paths if an inode has been deleted. It doesn't
need a log message at all, much less one at WARN level.
Dedupe failure can be INFO, the same level as dedupe itself, especially
since the "NO dedupe" message doesn't mention what was [not] deduped.
Inspired by Kai Krakow's "context: demote "abandoned toxic match" to
debug log level".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This log message creates a overwhelmingly lot of messages in the system
journal, leading to write-back flushing storms under high activity. As
it is a work-around message, it is probably only useful to developers,
thus demote to debug level.
This fixes latency spikes in desktop usage after adding a lot of new
files, especially since systemd-journal starts to flush caches if it
sees memory pressure.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Tasks are not allowed to be queued more than once, but it is allowed
to queue a Task while it's already running, which means a Task can be
executed on two threads in parallel. Tasks detect this and handle it
by queueing the Task on its own post-exec queue. That in turn leads
to Workers which continually execute the same Task if that Task doesn't
create any new Tasks, while other Tasks sit on the Master queue waiting
for a Worker to dequeue them.
For idle Tasks, we don't want the Task to be rescheduled immediately.
We want the idle Task to execute again after every available Task on
both the main and idle queues has been executed.
Fix these by having each Task reschedule itself on the appropriate
queue when it finishes executing.
Priority queued Tasks should executed in priority order not just one
Task's post-exec queue, but the entire local queue of the TaskConsumer.
Fix this by moving the sort into either the TaskConsumer that receives
a post-exec queue, if there is one, or into the Task that is created
to insert the post-exec queue into a TaskConsumer when one becomes
available in the future.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
next_transid tasks don't respect queue selection very well, because
they effectively end up spinning in a loop until all other worker
threads become busy.
Back this out, and fix the priority handling in the Task library.
This reverts commit 58db4071de.
Tasks using non-priority FIFO dependency tracking can insert themselves
into their own queue, to run the Task again immediately after it exits.
For priority queues, this attempts to splice the post-exec queue into
itself, which doesn't seem like a good idea.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Suppose Task A, B, and C are created in that order, and currently running.
Task T acquires Exclusion E. Task B, A, and C attempt to acquire the
same Exclusion, in that order, but fail because Task T holds it.
The result is Task T with a post-exec queue:
T, [ B, A, C ] sort_requested
Now suppose Task U acquires Exclusion F, then Task T attempts to acquire
Exclusion F. Task T fails to acquire F, so T is inserted into U's
post-exec queue. The result at the end of the execution of T is a tree:
U, [ T ] sort_requested
\-> [ B, A, C ] sort_requested
Task T exits after failing to acquire a lock. When T exits, T will
sort its post-exec queue and submit the post-exec queue for execution
immediately:
Worker 1: U, [ T ] sort_requested
Worker 2: A, B, C
This isn't ideal because T, A, B, and C all depend on at least one
common Exclusion, so they are likely to immediately conflict with T
when U exits and T runs again.
Ideally, A, B, and C would at least remain in a common queue with T,
and ideally that queue is sorted.
Instead of inserting T into U's post-exec queue, insert T and all
of T's post-exec queue, which creates a single flattened Task list:
U, [ T, B, A, C ] sort_requested
Then when U exits, it will sort [ T, B, A, C ] into [ A, B, C, T ],
and run all of the queued Tasks in age priority order:
U exited, [ T, B, A, C ] sort_requested
U exited, [ A, B, C, T ]
[ A, B, C, T ] on TaskConsumer queue
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Task started out as a self-organizing parallel-make algorithm, but ended
up becoming a half-broken wait-die algorithm. When a contended object
is already locked, Tasks enter a FIFO queue to restart and acquire the
lock. This is the "die" part of wait-die (all locks on an Exclusion are
non-blocking, so no Task ever does "wait"). The lock queue is FIFO wrt
_lock acquisition order_, not _Task age_ as required by the wait-die
algorithm.
Make it a 25%-broken wait-die algorithm by sorting the Tasks on lock
queues in order of Task ID, i.e. oldest-first, or FIFO wrt Task age.
This ensures the oldest Task waiting for an object is the one to get
it when it becomes available, as expected from the wait-die algorithm.
This should reduce the amount of time Tasks spend on the execution queue,
and reduce memory usage by avoiding the accumulation of Tasks that cannot
make forward progress.
Note that turning `TaskQueue` into an ordered container would have
undesirable side-effects:
* `std::list` has some useful properties wrt stability of object
location and cost of splicing. Other containers may not have these,
and `std::list` does have a `sort` method.
* Some Task objects are created at the beginning and reused continually,
but we really do want those Tasks to be executed in FIFO order wrt
submission, not Task ID. We can exclude these tasks by only doing the
sorting when a Task is queued for an Exclusin object.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Emphasize that the option is relevant to old kernels, older than the
minimum supportable version threshold.
De-emphasize the use case of "send-workaround" as a synonym for "exclude
read-only".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
One of the more obvious ways to reduce bees load is to simply not run
it all the time. Explicitly state using maintenance windows as a load
management option.
SIGUSR1 and SIGUSR2 should have been documented somewhere else before now.
Better late than never.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The theories behind bees slowing down when presented with a larger has
table turned out to be wrong. The real cause was a very old bug which
submitted thousands of `LOGICAL_INO` requests when only a handful of
requests were needed.
"Compression on the filesystem" -> "Compression in files"
Don't be so "dramatic". Be "rapid" instead.
Remove "cannot avoid modifying read-only snapshots" as a distinction
between subvol and extent scans. Both modes support send workaround
and send waiting with no significant distinction.
Emphasize extent scan's better handling of many snapshots. Also reflinks.
Add some discussion of `--throttle-factor`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Thread names have changed. Document some of the newer ones.
Don't jump immediately to blaming poor performance on qgroups or
autodefrag. These do sometimes have kernel regressions but not all
the time.
Emphasize advantage of controlling bees deferred work requests at the
source, before btrfs gets stuck committing them.
Avoid asserting that it's OK for gdb to crash.
Remove mention of lower-layer block device issues wrt corruption.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
"Kernel" -> "Linux kernel". If you can run bees on a kernel that isn't
Linux, congratulations!
Emphasize the age of the data corruption warnings. Once 5.4 reaches
EOL we can remove those.
Simplify the discussion of old kernels and API levels. There's a
new optional kernel API for `openat2` support at 5.6. The absolute
minimum kernel version is still 4.2, and will not increase to 4.15
until the subvol scanners are removed.
Remove discussion of bees support for kernels 4.19 (which recently
reached EOL) and earlier.
The `LOGICAL_INO` vs dedupe bug is actually a `LOGICAL_INO` vs clone bug.
Dedupe isn't necessary to reproduce it.
Remove a stray ')'.
Strip out most of the discussion of slow backrefs, as they are no longer a
concern on the range of supported kernel versions. Leave some description
there because bees still has some vestigial workarounds.
Remove `btrfs send` from the "Unfixed kernel bugs" section, which makes
the section empty, so remove the section too. bees now handles send on
a subvol reasonably well.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Emphasize "large" is an upper bound on the size of filesystem bees
can handle.
New strengths: largest extent first for fixed maintenance windows,
scans data only once (ish), recovers more space
Removed weaknesses: less temporary space
Need more caps than `CAP_SYS_ADMIN`.
Emphasize DATA CORRUPTION WARNING is an old-kernel thing.
Update copyright year.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Tested on larger filesystems than 100T too, but let's use Fermi
approximation. Next size is 1P.
Removed interaction with block-level SSD caching subsystems. These are
really btrfs metadata vs. a lower block layer, and have nothing to do
with bees.
Added mixed block groups to the tested list, as mixed block groups
required explicit support in the extent scanner.
Added btrfs-convert to the tested list. btrfs-convert has various
problems with space allocation in general, but these can be solved by
carefully ordered balances after conversion, and they have nothing to
do with bees.
In-kernel dedupe is dead and the stubs were removed years ago. Remove it
from the list.
btrfs send now plays nicely with bees on all supportable kernels, now
that stable/linux-4.19.y is dead. Send workaround is only needed for
kernels before v5.4 (technically v5.2, but nobody should ever mount a
btrfs with kernel v5.1 to v5.3). bees will pause automatically when
deduping a subvol that is currently running a send.
bees will no longer gratuitously refragment data that was defragmented
by autodefrag.
Explicitly list all the RAID profiles tested so far, as there have been
some new ones.
Explicitly list other deduplicators tested.
Sort the list of btrfs features alphabetically.
Add scrub and balance, which have been tested with bees since the
beginning.
New tested btrfs features: block-group-tree, raid1c3, raid1c4.
New untested btrfs features: squotas, raid-stripe-tree.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This records the time when the progress data was calculated, to help
indicate when the data might be very old.
While we're here, move "now" out of the loop so there's only one value.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This increases resistance to symlink and mount attacks.
Previously, bees could follow a symlink or a mount point in a directory
component of a subvol or file name. Once the file is opened, the open
file descriptor would be checked to see if its subvol and inode matches
the expected file in the target filesystem. Files that fail to match
would be immediately closed.
With openat2 resolve flags, symlinks and mount points terminate path
resolution in the kernel. Paths that lead through symlinks or onto
mount points cannot be opened at all.
Fall back to openat() if openat2() returns ENOSYS, so bees will still
run on kernels before v5.6.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Since we're now using weak symbols for dodgy libc functions, we might
as well do it for gettid() too.
Use the ::gettid() global namespace and let libc override it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
openat2 allows closing more TOCTOU holes, but we can only use it when
the kernel supports it.
This should disappear seamlessly when libc implements the function.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
"ctime", an abbreviation of "cycle time", collides with "ctime", an
abbreviation of "st_ctime", a well-known filesystem term.
"tm_left" fits in the column, so use that.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
* Report position within cycle in units that cannot be mistaken for size or percentage
* Put the total/maximum values in their own row
* Add a start time column
* Change column titles to reference "cycles"
* Use "idle" instead of "finished" when a crawler is not running
* Replace "transid" with "gen" because it's shorter
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The scanners which finish early can become stuck behind scanners that are
able to keep the queue full. Switch the next_transid task to the normal
Task queues so that we force scanners to restart on every new transaction,
possibly deferring already queued work to do so.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add yet another field to the scan/skip report line: the wallclock
time used to process the extent ref.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The total data size should not include metadata or system block groups,
and already does not; however, we still have these block groups in the map
for mapping the crawl pointer to a logical offset within the filesystem.
Rearrange a few lines around the `if` statement so that the map doesn't
contain anything it should not.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The progress indicator was failing on a mixed-bg filesystem because those
filesystems have block groups which have both _DATA and _METADATA bits,
and the filesystem size calculation was excluding block groups that have
_METADATA set. It should exclude block groups that have _DATA not set.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Running bees with no arguments complains about "Only one" path argument.
Replace this with "Exactly one" which uses similar terminology to other
btrfs tools.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
`getopt_long` already supplies a message when an option cannot be parsed,
so there isn't a need to distinguish option parse failures from help
requests.
Fixes: https://github.com/Zygo/bees/pull/277
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Longer latency testing runs are not showing a consistent gain from a
throttle factor of 1.0. Make the default more conservative.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Decaying averages by 10% every 5 minutes gives roughly a half-hour
half-life to the rolling average. Speed that up to once per minute.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We're not adding any more short options, but the debugging code doesn't
work with optvals above 255. Also clean up constness and variable
lifetimes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Measure the time spent running various operations that extend btrfs
transaction completion times (`LOGICAL_INO`, tmpfiles, and dedupe)
and arrange for each operation to run for not less than the average
amount of time by adding a sleep after each operation that takes less
than the average.
The delay after each operation is intended to slow down the rate of
deferred and long-running requests from bees to match the rate at which
btrfs is actually completing them. This may help avoid big spikes in
latency if btrfs has so many requests queued that it has to force a
commit to release memory.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Test machines keep blowing past the 32k file limit. 16 worker
threads at 10,000 files each is much larger than 32k.
Other high-FD-count services like DNS servers ask for million-file
rlimits.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
While a snapshot is being deleted, there will be a continuous stream of
"No ref for extent" messages. This is a common event that does not need
to be reported.
There is an analogous situation when a call to open() fails with ENOENT.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Dedupe is not possible on a subvol where a btrfs send is running:
BTRFS warning (device dm-22): cannot deduplicate to root 259417 while send operations are using it (1 in progress)
btrfs informs a process with EAGAIN that a dedupe could not be performed
due to a running send operation.
It would be possible to save the crawler state at the affected point,
fork a new crawler that avoids the subvol under send, and resume the
crawler state after a successful dedupe is detected; however, this only
helps the intersection of the set of users who have unrelated subvols
that don't share extents, and the set of users who cannot simply delay
dedupe until send is finished. The simplest approach is to simply stop
and wait until the send goes away.
The simplest approach is taken here. When a dedupe fails with EAGAIN,
affected Tasks will poll, approximately once per transaction, until the
dedupe succeeds or fails with a different error.
bees dedupe performance corresponds with the availability of subvols that
can accept dedupe requests. While the dedupe is paused, no new Tasks can
be performed by the worker thread. If subvols are small and isolated
from the bulk of the filesystem data, the result will be a small but
partial loss of dedupe performance during the send as some worker threads
get stuck on the sending subvol. If subvols heavily share extents with
duplicate data in other subvols, worker threads will all become blocked,
and the entire bees process will pause until at least some of the running
sends terminate.
During the polling for btrfs send, the dedupe Task will hold its dst
file open. This open FD won't interfere with snapshot or file delete
because send subvols are always read-only (it is not possible to delete
a file on a RO subvol, open or otherwise) and send itself holds the
affected subvol open, preventing its deletion. Once the send terminates,
the dedupe will terminate soon after, and the normal FD release can occur.
This pausing during btrfs send is unrelated to the
`--workaround-btrfs-send` option, although `--workaround-btrfs-send` will
cause the pausing to trigger less often. It applies to all scan modes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are no callers of this method any more, and it exposes more
of BeesRoots than we really want things to have access to.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
All callers of the `transid_max_nocache` method update `m_transid_re`
with the return value, so do that in `transid_max_nocache` itself.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
* Allow RateLimiter to change rate after construction.
* Check range of rate argument in constructor.
* Atomic increment for RateEstimator.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The "done" pointer and the "%done" fields are still useful because they
indicate _actual_ progress, not the work that has been _promised_.
So it is possible for a crawl to be "finished" (all extents queued)
but not "100.0000%" (some of those extents still active or in the queue).
"deferred" state isn't particularly useful, so drop it.
"finished" state implies no ETA, so that column is unused.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
ETA is calculated using a sample obtained by snooping on bees's normal
crawling operations.
This sample is heavily biased and not representative of the entire
filesystem. If the distribution of extent sizes in the filesystem is
not uniform, the ETA can be wildly wrong.
Collecting an accurate sample set would require extra IO and CPU time
which should be spent doing dedupes instead.
Explicitly label the ETA as inaccurate to avoid having too many users
report the same bug.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
bees might be unpaused at any time, so make sure that the dynamic load
calculation is ready with a non-zero thread count.
This avoids a delay of up to 5 seconds when responding to SIGUSR2
when loadavg tracking is enabled.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These are simple on/off switches for the task queue. They are lightweight
requests for bees to be paused temporarily, but allow bees to release
open files and save progress while paused.
These signals are an alternative to SIGSTOP and SIGCONT, or using the
cgroup freezer's FROZEN and THAWED states, which pause and resume the
bees process, but do not allow the bees process to release open files
or save progress. Snapshot and file deletes can occur on the filesystem
while bees is paused by SIGUSR1 but not by SIGSTOP.
These signals are also an alternative to SIGTERM and restart, which
flush out the whole hash table and progress state on exit, and read
the whole table back into memory on restart.
This feature is experimental and may be replaced by a more general
configuration or runtime control mechanism in the future.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When paused, TaskConsumer threads will eventually notice the paused
condition and exit; however, there's nothing to restart threads when
exiting the paused state.
When unpausing, and while the lock is already held, create TaskConsumer
threads as needed to reach the target thread count.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Commit 72c3bf8438 ("fs: handle ENOENT
within lib") was meant to prevent exceptions when a subvol is deleted.
If the search ioctl fails, the kernel won't set nr_items in the
ioctl output, which means `nr_items` still has the input value. When
ENOENT is detected, `this->nr_items` is set to 0, then later `*this =
ioctl_ptr->key` overwrites `this->nr_items` with the original requested
number of items.
This replaced the ENOENT exception with an exception triggered by
interpreting garbage in the memory buffer. The number of exceptions
was reduced because the memory buffers are frequently reused, but upper
layers would then reject the data or ignore it because it didn't match
the key range.
Fix by setting `ioctl_ptr->key.nr_items`, which then overwrites
`this->nr_items`, so the loop that extracts items from the ioctl data
gets the right number of items (i.e. zero).
Fixes: 72c3bf8438 ("fs: handle ENOENT within lib")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In some cases the offset and size arguments were flipped when checking to
see if a range had already been read. This would have been OK as long as
the same mistake had been made consistently, since `bees_readahead_check`
only does a cache lookup on the parameters, it doesn't try to use them to
read a file. Alas, there was one case where the correct order was used,
albeit a relatively rare one.
Fix all the calls to use the correct order.
Also fix a comment: the recent request cache is global to all threads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
hexdump was moved into a template in its own header years ago, but
the declaration of the implementation that used to be in fs.cc remains.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
hexdump processes a vector as a contiguous sequence of bytes, regardless
of V's value type, so hexdump should get a pointer and use uint8_t to
read the data.
Some vector types have a lock and some atomics in their operator[], so
let's avoid hammering those.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
operator<< was a friend class that locked the ByteVector, then invoked
hexdump on the bytevector, which used ByteVector::operator[]...which
locked the ByteVector, resulting in a deadlock.
operator<< shouldn't be a friend class anyway. Make hexdump use the
normal public access methods for ByteVector.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Although all the members of BtrfsExtentDataFetcher are theoretically
copiable, there's no need to actually make any such copy.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make it clearer where the progress information goes.
Also add placeholder text so the progress section isn't empty at startup,
when the progress hasn't been calculated yet.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Toxic extents are mostly gone in kernel 5.7 and later. Increase the
timeout for toxic extent handling to reduce false positives, and remove
persistenly stored toxic hashes from the hash table.
Toxic hashes are still stored nonpersistently to help mitigate problems
due to any remaining kernel bugs.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The serialization doesn't seem to be necessary for the extent scan mode.
No infinite loops in the kernel have been observed in the past two years,
despite never having used MultiLock for the extent scanner.
Leave the serialization for now on the subvol scanners.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The extent scan mode has been implemented (partially, but close enough
to win benchmarks).
New features include several nuisance dedupe countermeasures.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Extent is a different kind of scan mode, so introduce the concept of
the two kinds of scan mode, and rearrange the description of scan modes
along the new boundaries.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We don't need the subvol numbers since they're only interesting to
developers.
We don't need both max and min sizes, pick one and drop the other.
Replace "16E" with "max"--it is the same number of characters, but
doesn't require the user to know what 1<<64 is off the top of their head.
Shorten "remain" to "todo" because sometimes those extra two columns
matter.
Drop the seconds field in ETA timestamps. Long scan arrival times are
years away, and short scan arrival times are only updated once every
5 minutes, so the extra precision isn't useful.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make the progress information more accessible, without having to
enable full debug log and fish it out of the stream with grep.
Also increase the progress log level to INFO.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are two crawl_maps in extent scan's next_transid: one gets
initialized, the other gets used. This works OK as long as bees is
resuming an existing scan, because the two maps are identical; however,
but it fails if bees is starting without an existing set of crawl data,
and one of the two maps is empty or partially filled.
The failure is intermittent, as the crawl map is being populated at
the same time next_transid runs. It will eventually be completed after
several transaction cycles, at which point bees runs normally.
It does add significant delays during startup for benchmarks.
There's only one crawl_map in extent scan, it always has the same
crawlers, and extent scan's `next_transid` creates it by itself.
Ignore the map from BeesRoots/BeesCrawl.
Also throw in some missing but helpful trace statements.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Replace pointers in the "done" and "total" columns with estimated data
sizes for each size tier. The estimation is based on statistics
collected from extents scanned during the current bees run.
Move the total size for the entire filesystem up to the heading.
Report the _completed_ position (i.e. the one that would be saved in
`beescrawl.dat`), not the _queued_ position (i.e. the one where the
next Task would be created in memory).
At the end of the data, the crawl pointer ends up at some random point
in the filesystem just after the newest extent, so the progress gets to
99.7% and then goes to some random value like 47% or 3%, not to 100%.
Report "deferred" in the "done" column when the crawler is waiting for
the next transid, and "finished" in the "%done" column when the crawler
has reached the end of the data. Suppress the ETA when finished. This
makes it clear that there's no further work to do for these crawlers.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesScanModeExtent uses six scan Tasks instead of one, which leads
to awkwardness like the do_scan method to tell crawl_roots how to do
what it shouldn't need to know how to do anyway.
Move the crawl_roots logic into the ::scan methods themselves.
This also deletes the very popular "crawl_more ran out of data" message.
Extent scan explicitly indicates when a scan is complete, so there's
no longer a need to fish this message out of the log.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The sorting avoids problematic read orders, like extent refs in the same
inode with descending offsets, that btrfs is not optimized for.
Putting everything in one Task keeps the queue sizes small, and
manages the lock contention much more calmly.
We only want to be mapping extent refs if there's not enough extents
already in the queue to keep worker threads busy, so use the `idle()`
method instead of `run()`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The EXTENT scan mode reads the extent tree, splits it into tiers by
extent size, converts each tiers's extents into subvol/inode/offset refs,
then runs the legacy bees dedupe engine on the refs.
The extent scan mode can cheaply compute completion percentage and ETA,
so do that every time a new transid is observed.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add a second level queue which is only serviced when the local and global
queues are empty.
At some point there might be a need to implement a full priority queue,
but for now two classes are sufficient.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This should help clean up some of the uglier status outputs.
Supports:
* multi-line table cells
* character fills
* sparse tables
* insert, delete by row and column
* vertical separators
and not much else.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We can no longer reliably determine the number of hash table matches,
since we'll stop counting after the first one.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We were doing a `LOGICAL_INO` ioctl on every _block_ of a matching extent,
just to see how long it takes. It takes a while!
This could be modified to do an ioctl with the `IGNORE_OFFSET` flag,
once per new extent, but the kernel bug was fixed a long time ago, so
we can start removing all the toxic extent code.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When we have multiple possible matches for a block, we proceed in three
phases:
1. retrieve each match's extent refs and put them in a list,
2. iterate over the list converting viable block matches into range matches,
3. sort and flatten the list of range matches into a non-overlapping
list of ranges that cover all duplicate blocks exactly once.
The separation of phase 1 and 2 creates a performance issue when there
are many block matches in phase 1, and all the range matches in phase
2 are the same length. Even though we might quickly find the longest
possible matching range early in phase 2, we first extract all of the
extent refs from every possible matching block in phase 1, even though
most of those refs will never be used.
Fix this by moving the extent ref retrieval in phase 1 into a single
loop in phase 2, and stop looping over matching blocks as soon as any
dedupe range is created. This avoids iterating over a large list of
blocks with expensive `LOGICAL_INO` ioctls in an attempt to improve the
match when there is no hope of improvement, e.g. when all match ranges
are 4K and the content is extremely prevalent in the data.
If we find a matched block that is part of a short matching range,
we can replace it with a block that is part of a long matching range,
because there is a good chance we will find a matching hash block in
the long range by looking up hashes after the end of the short range.
In that case, overlapping dedupe ranges covering both blocks in the
target extent will be inserted into the dedupe list, and the longest
matches will be selected at phase 3. This usually provides a similar
result to that of the loop in phase 1, but _much_ more efficiently.
Some operations are left in phase 1, but they are all using internal
functions, not ioctls.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A laundry list of problems fixed:
* Track which physical blocks have been read recently without making
any changes, and don't read them again.
* Separate dedupe, split, and hole-punching operations into distinct
planning and execution phases.
* Keep the longest dedupe from overlapping dedupe matches, and flatten
them into non-overlapping operations.
* Don't scan extents that have blocks already in the hash table.
We can't (yet) touch such an extent without making unreachable space.
Let them go.
* Give better information in the scan summary visualization: show dedupe
range start and end points (<ddd>), matching blocks (=), copy blocks
(+), zero blocks (0), inserted blocks (.), unresolved match blocks
(M), should-have-been-inserted-but-for-some-reason-wasn't blocks (i),
and there's-a-bug-we-didn't-do-this-one blocks (#).
* Drop cached data from extents that have been inserted into the hash
table without modification.
* Rewrite the hole punching for uncompressed extents, which apparently
hasn't worked properly since the beginning.
Nuisance dedupe elimination:
* Don't do more than 100 dedupe, copy, or hole-punch operations per
extent ref.
* Don't split an extent or punch a hole unless dedupe would save at
least half of the extent ref's size.
* Write a "skip:" summary showing the planned work when nuisance
dedupe elimination decides to skip an extent.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add a master switch to turn off the entire MultiLock infrastructure for
testing, without having to remove and add all the individual entry points.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This prevents the storms of exceptions that occur when a subvol is
deleted. We simply treat the entire tree as if it was empty.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
try_lock allows specification of a different Task to be run instead of
the current Task when the lock is busy.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Originally the limit was 2730 (64KiB worth of ref pointers). This limit
was a little too low for some common workloads, so it was then raised by
a factor of 256 to 699050, but there are a lot of problems with extent
counts that large. Most of those problems are memory usage and speed
problems, but some of them trigger subtle kernel MM issues.
699050 references is too many to be practical. Set the limit to 9999,
only 3-4x larger than the original 2730, to give up on deduplication
when each deduped ref reduces the amount of space by no more than 0.01%.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This solves a third bad problem with bees reads:
3. The architecture above the read operations will issue read requests
for the same physical blocks over and over in a short period of time.
Fixing that properly requires rewriting the upper-level code, but a
simple small table of recent read requests can reduce the effect of the
problem by orders of magnitude.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This solves some of the worst problems with bees reads:
1. The kernel readahead doesn't work. More precisely, it's much better
adapted for a very different use case: a single thread alternating
between reading a file sequentially and processing the data that was read.
bees has multiple threads which compete for access to IO and then issue
reads in random order immediately after the call to readahead. The kernel
uses idle ioprio scheduling for the readaheads, so the readaheads get
preempted by the random reads, or cancels the readaheads because the
data access pattern isn't sequential after the readahead was issued.
2. Seeking drives perform terribly with multiple competing readers,
especially with btrfs striped profiles where the iops are broken into
tiny stripe-sized pieces. At one point I intended to read the btrfs
device map and figure out which devices can be read in parallel, but to
make that useful, the user needs to have an array with multiple drives
in single profile, or 4+ drives in raid1 profile. In all other cases,
the elaborate calculations always return the same result: there can be
only one reader at a time.
This commit fixes both problems:
1. Don't use the kernel readahead. Use normal reads into a dummy
buffer instead.
2. Allow only one thread to readahead at any time. Once the read is
completed, the data is in the page cache, and all the random-order small
reads that bees does will hit the page cache, not a spinning disk.
In some cases we need to read two things close together, so add a
`bees_readahead_pair` which holds one lock across both reads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The hash table is read sequentially and from a single thread, so
the kernel's implementation of readahead is appropriate here.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Commit c3b664fea5 ("context: don't forget
to retry locked extents") removed the critical return that prevents a
Task from processing an extent that is locked.
Put the return back.
Fixes: c3b664fea5 ("context: don't forget to retry locked extents")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The kernel has not required a 16 MiB limit on dedupe requests since
v4.18-rc1 b67287682688 ("Btrfs: dedupe_file_range ioctl: remove 16MiB
restriction").
Kernels before v4.18 would truncate the request and return the size
actually deduped in `bytes_deduped`. Kernel v4.18 and later will loop
in the kernel until the entire request is satisfied (although still
in 16 MiB chunks, so larger extents will be split).
Modify the loop in userspace to measure the size the kernel actually
deduped, instead of assuming the kernel will only accept 16 MiB.
On current kernels this will always loop exactly once.
Since we now rely on `bytes_deduped`, make sure it has a sane value.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These were added to crucible all the way back in 2018 (1beb61fb78
"crucible: error: record location of exception in what() message")
but it's even more useful in the stack tracer in bees.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Since we'll never process more than BEES_MAX_EXTENT_REF_COUNT extent
references by definition, it follows that we should not allocate buffer
space for them when we perform the LOGICAL_INO ioctl.
There is some evidence (particularly
https://github.com/Zygo/bees/issues/260#issuecomment-1627598058) that
the kernel is subjecting the page cache to a lot of disruption when
trying allocate large buffers for LOGICAL_INO.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Apparently reinterpret_cast<uint64_t> sign-extends 32-bit pointers.
This is OK when running on a 32-bit kernel that will truncate the pointer
to 32 bits, but when running on a 64-bit kernel, the extra bits are
interpreted as part of the (now very invalid) address.
Use <uintptr_t> instead, which is unsigned, integer, and the same word
size as the arch's pointer type. Ordinary numeric conversion can take
it from there, filling the rest of the word with zeros.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The bug is:
v6.3-rc6: f349b15e183d mm: vmalloc: avoid warn_alloc noise caused by fatal signal
The fixes are:
v6.4: 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
v6.3.10: c189994b5dd3 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
The bug has been backported to LTS, but the fix has not:
v6.2.11: 61334bc29781 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
v6.1.24: ef6bd8f64ce0 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
v5.15.107: a184df0de132 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There was a bug in kernel 6.3 where LOGICAL_INO with IGNORE_OFFSET
sometimes fails to ignore the offset. That bug is now fixed, but
LOGICAL_INO still returns 0 refs much more often than seems appropriate.
This is most likely because bees frequently deletes extents while there
is still work waiting for them in Task queues. In this case, LOGICAL_INO
correctly returns an empty list, because every reference to some extent
is deleted, but the new extent tree with that extent removed is not yet
committed in btrfs.
Add a DEBUG-level log message and an event counter to track these events.
In the absence of a kernel bug, the debug message may indicate CPU time
was wasted performing a search whose outcome could have been predicted.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Toxic extents are much less of a problem now than they were in kernels
before 5.7. Downgrade the log message level to reflect their lesser
importance.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The critical kernel bugs in send have been fixed for years.
The limitations that remain aren't bugs, and bees has no sustainable
workaround for them.
Also update copyright year range.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We check the result of transid_max_nocache(), but not the result of
transid_max(). The latter is a computed result that is even more likely
to be wrong[citation needed].
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
At least one user was significantly confused by "designed for large
filesystems".
The btrfs send workarounds aren't new any more.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Clarify that "too large" and "too small" are some distance away from each other.
The Goldilocks zone is _wide_.
The interval between cache drops is now shorter.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Each object contains a 16 MiB buffer, which is very heavy for some
malloc implementations.
Keep the objects in a Pool so that their buffers are only allocated and
deallocated once in the process lifetime.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Some malloc implementations will try to mmap() and munmap() large buffers
every time they are used, causing a severe loss of performance.
Nothing ever overrode the virtual methods, and there was no virtual
destructor, so they cause compiler warnings at build time when used with
a template that tries to delete pointers to them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
ProgressTracker was only freeing memory for work items when they reach
the head of the work tracking queue. If the first work item takes
hours to complete, and thousands of items are processed every second,
this leads to millions of completed items tracked in memory at a time,
wasting gigabytes of system RAM.
Rewrite ProgressHolderState methods to keep only incomplete work items
in memory, regardless of the order in which they are added or removed.
Also fix the unit tests which were relying on the memory leak to work,
and add test cases for code coverage.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If the send workaround is enabled, it is possible for two threads (a
thread running the crawl_new task, and a thread attempting to apply the
send workaround) to access the same RootFetcher object at the same time.
That never ends well.
Give each function its own BtrfsRootFetcher object.
Fixes: https://github.com/Zygo/bees/issues/250
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
With SIGTERM and fast exit, the trickle writeback is less important.
We don't want to flood people's IO subsystems with continuous writes.
This really should be configurable at runtime.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Do rebuild bees-version.cc if libcrucible changes.
Don't rebuild bees-version.cc if it doesn't change.
Also use the standard suffix for new files.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
crucible::VERSION doesn't make much sense now that libcrucible no
longer exists as a shared library. Nothing ever referenced it, so
it can go away.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
According to ioctl_iflags(2):
The type of the argument given to the FS_IOC_GETFLAGS and
FS_IOC_SETFLAGS operations is int *, notwithstanding the
implication in the kernel source file include/uapi/linux/fs.h
that the argument is long *.
So this code doesn't work on be64 machines.
Also, Valgrind complains about it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A subtle distinction, and not one that is particularly relevant to bees,
but it does make toolchains complain.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Another instance of the pattern where we derived a crucible class
from a btrfs struct. Make it an automatic variable instead.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This was fixed in
7f660f50b lib: fs: stop using libbtrfs-dev helper functions to re-enable buffer length checks
but apparently some copies live on.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
These tools are obsolete. fiemap was a thin wrapper around FIEMAP,
but FIEMAP is not useful on btrfs. fiewalk was a thin wrapper around
BtrfsExtentWalker, but development on BtrfsExtentWalker has been
abandoned.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When a hash table write fails, we skip over the write throttling because
we didn't report that we successfully wrote an extent. This can be bad
if the filesystem is full and the allocations for writes are burning a
lot of CPU time searching for free space.
We also don't retry the write later on since we assume the extent is
clean after a write attempt whether it was successful or not, so the
extent might not be written out later when writes are possible again.
Check whether a hash extent is dirty, and always throttle after
attempting the write.
If a write fails, leave the extent dirty so we attempt to write it out
the next time flush cycles through the hash table. During shutdown
this will reattempt each failing write once, after that the updated hash
table data will be dropped.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Calling 'bees -m4' should not call 'std::terminate()', but it does.
Use catch_all instead. It will still pass the exit value to return
from main.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We only use BtrfsExtentInfo when it's exactly equivalent to the
base, so drop the derived class.
While we're here, fix BtrfsExtentSame::add so it uses a btrfs-compatible
uint64_t instead of an off_t.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BEESTOOLONG was always reporting a size of zero, and the offset of the
end of the readahead region. Report the original size instead (and also
in BEESTRACE and BEESNOTE).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Drop the crawl_restart counter, it doesn't happen here (or anywhere else).
Add the crawl_again counter for extents that are restarted due to an
extent-level lock.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
libcrucible can deal with the Linux kernel and/or libc's thread name
limitations. No need to duplicate that work in bees.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It turns out I've been using pthread_setname_np wrong the whole time:
* on Linux, the thread name length is 15 characters.
TASK_COMM_LEN is 16 bytes, and the last one is always 0.
This is now hardcoded in many places and cannot be changed.
* pthread_setname_np doesn't return -errno, so DIE_IF_MINUS_ERRNO
was the wrong macro. On the other hand, we never want to do anything
differently when pthread_setname_np fails, so we never needed to
check the return value.
Also, libc silently ignores attempts to set the thread name when it is too
long. That's almost certainly a libc bug, but libc probably suppresses
the error result for the same reasons I ignore the error result.
Wrap the pthread_setname function with a C++ std::string overload that
truncates the argument at 15 characters, so we at least get the first
part of the task name in the thread name field. Later commits can deal
with making the bees thread names shorter.
Also wrap pthread_getname for symmetry.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The caller of scan_forward has to stop advancing the BeesFileCrawl
position when an extent lock blocks a scan, so that it will resume
from the same position when the Task is scheduled again; otherwise,
bees simply skips over the extent and leave it incompletely deduped.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Restart crawl_more (and update crawl roots and flush FD caches) every
time the transid changes, and only when the transid changes, but
not more often than a reasonable minimum poll interval.
Clean up the log message: use the proper thread name and remove
the wildly inaccurate estimate of when crawl will resume.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We don't need to cache 65536 extent maps, especially if each one
can have almost 700K references.
Valgrind's massif tool points to the extent map cache as a very
large memory allocator, but test runs with memcg disagree.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we have loadavg targeting enabled, there may be no worker threads
available to respond to new subvols, so we should not bother updating
the subvols list.
Put insert_new_crawl into a Task so it only executes when a worker
is available.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
On large filesystems where the min_transid of all subvols gets stuck at 0,
bees may lose the ability to effectively track recent data. A secondary sort
by max_transid will allow scanning newer subvols that were created after bees
started running on the filesystem, but before bees completed the first scan
of all subvols.
On the other hand, the secondary sort does a reverse version of the
sequential scan mode, and the sequential scan mode is simply awful.
Disable it for now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Also attempted to clarify the descriptions of the modes based on
feedback and questions from users over the years.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Split each scan mode into two distinct phases:
1. A heavy discovery phase, where we search the entire filesystem
for something (new items in subvol trees in this case).
2. A light consuming phase, where we fetch extents to dedupe
from places that we found in the discovery phase.
Part 1 recomputes the subvol ordering every time there is a new transid.
For some scan modes this computation is quite expensive, far too costly
to pay for every extent, so we do it no more than once per transaction.
Part 2 is run every time a worker thread hits the crawl_more Task.
It simply pulls one extent from the first crawler off a sorted list,
removing the crawler from the list when the crawler runs out of data.
Part 1 creates a new structure and swaps it into place, while Part 2
continues to run using the previous strucuture. Neither of these
need to block the other, so they don't.
The separate class and base pointer also make it easer to add new scan
modes that are not based on subvol trees or that don't use BeesCrawl.
While we're here, fix up some method visibility in BeesRoots.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Set the constructor's default scan mode to an invalid mode, so if we
change the default, we don't have to update two places.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Crawl mode 3 'recent' prioritizes data from new updates to previously
scanned subvols over subvols that have not been completely scanned yet.
If no such new data exists, falls back to a variation of 'lockstep'
scan mode.
This enables us to keep up with new data as it arrives, a key weakness
of all the other scan modes, and worth violating our unwritten "no new
scan modes until we have extent-tree dedupe working" policy for.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Inode-oriented scan workers must do all of their work sequentially,
so it's counterproductive to spawn a Task to do a background dedupe.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When two Tasks attempt to lock the same extent, append the later Task
to the earlier Task's post-exec work queue. This will guarantee that
all Tasks which attempt to manipulate the same extent will execute
sequentially, and free up threads to process other extents.
Similarly, if two scanner threads operate on the same inode, any dedupe
they perform will lock out other scanner threads in btrfs. Avoid this
by serializing Task objects that reference the same file.
This does theoretically use an unbounded amount of memory, but in practice
a Task that encounters a contended extent or inode quickly stops spawning
new Tasks that might increase the queue size, and all Tasks that might
contend for the same lock(s) end up on a single FIFO queue.
Note that the scope of inode locks is intentionally global, i.e. when
an inode is locked, it locks every inode with the same number in every
subvol. This avoids significant lock contention and task queue growth
when the same inode with the same file extents appear in snapshots.
Fixes: https://github.com/Zygo/bees/issues/158
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Split crawlers into two separate Tasks:
1. a Task which locates the next inode with a new data extent.
2. a Task which scans every new extent in that inode.
This simplifies some lock contention and execution ordering issues.
Files are read sequentially. Workers dynamically scale up or
down as needed, without creating thousands of deferred Task objects.
Workers obtain inode locks for different inodes in btrfs, so they
can work in parallel instead of waiting for each other.
This change in behavior comes with new names for the worker Tasks:
"crawl_master" is now "crawl_more", the singular Task which
creates inode-scanning Tasks.
"crawl_<subvol>" is now "crawl_<subvol>_<inode>".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This was done on the development branch three years ago, and
has been creating annoying merge conflicts ever since. Sync
up the branches so they have the same names for these.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Drop the cache since we no longer have to open a file every time we
check a subvol's status.
Also stop counting workaround events at the root level twice.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
btrfs-tree provides classes for low-level access to btrfs tree objects.
An item class is provided to decode polymorphic btrfs item fields.
Several tree classes provide forward and backward iteration over raw
object items at different tree levels.
A csum tree class provides convenient access to csums by bytenr,
supporting all current btrfs csum types.
Wrapper classes for inode and subvol items provide direct access to
btrfs metadata fields without clumsy stat() wrappers or ioctls.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This template turns a forward search primitive (e.g. lower_bound, FIEMAP,
TREE_SEARCH_V2) into a backward search primitive.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We are using ByteVectors from multiple threads in some cases. Mostly
these are the status and progress threads which read the ByteVector
object references embedded in BEESNOTE macros.
Since it's not clear what the data race implications are, protect
the shared_ptr in ByteVector with a mutex for now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Kernels that needed the balance workaround frankly are too buggy
to run bees at all. The workaround also makes the locking stories
around logical_ino calls and process exit complicated, so get rid of
it completely.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
At some point BtrfsExtentWalker will be fully deprecated and removed from
bees. Might as well start with code that hasn't been built in 6 years.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Quite often bees exceeds its service timeout for termination because
it is waiting for a loop embedded in a Task to finish some long-running
btrfs operation. This can cause bees to be aborted by SIGKILL before
it can completely flush the hash table or save crawl state.
There are only two important things SIGTERM does when bees terminates:
1. Save crawl progress
2. Flush out the hash table
Everything else is automatically handled by the kernel when the process
is terminated by SIGKILL, so we don't have to bother doing it ourselves.
This can save considerable time at shutdown since we don't have to wait
for every thread to reach a point where it becomes idle, or force loops
to terminate by throwing exceptions, or check a condition every time we
access a pointer. Instead, we need do only the things in the list
above, and then call _exit() to clean up everything else.
Hash table and crawl state writeback can happen in their background
threads instead of the foreground one. Separate the "stop" method for
these classes into "stop_request" and "stop_wait" so that these writebacks
can run at the same time.
Deprecate and remove all references to the BeesHalt exception, and remove
several unnecessary checks for BeesContext::stop_requested.
Pause the task queue instead of cancelling it, which preserves the
crawl progress state and stops new Tasks from competing for iops and
CPU during writeback.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Dump the instantaneous load (last 5 seconds, extracted from load average)
and the computed target worker count (before rounding and truncation)
on the same status line as the task and worker thread count.
This should give better visibility into Task's thread count calculation
algorithm.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Tasks are often running longer than 5 seconds (especially extents with
multiple references requiring copy operations), so the load tracking
algorithm needs to average several samples over a longer period of time
than 5 seconds. If the sample period is 60 seconds, we end up recomputing
the original load average from current_load, so skip the rounding error
and use the original load average value.
Arguably the real fix is to break up the more complex extent operations
over several downstream Task objects, but that's a more significant
design change.
Tweak the attack and decay rates so that threads are started a little
more slowly, but still stopped rapidly when load spikes up.
Remove the hysteresis to provide support for load average targets
below 1, or with fractional components, with a PWM-like effect.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
task1.append(task2) is supposed to run task2 after task1 is executed;
however, if task1 was just executed, and its last reference was owned by
a TaskConsumer, then task2 will be appended to a Task that will never
run again.
A similar problem arises in Exclusion, which can cause blocked tasks
to occasionally be dropped without executing them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This was resulting in an assertion failure later on if a queue was
being rescued from a deleted task with only one post-exec queue.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
pause(true) stops the TaskMaster from processing any more Tasks,
but does not destroy any queued Tasks.
pause(false) re-enables Task processing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In the event that someday Barrier allows users to force execution of
its pending tasks prior to the destruction of the BarrierState object,
we'll be ready to submit those Tasks for execution without waiting for
the BarrierState mutex lock.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Exclusion was generating a new Task every time a lock was contended.
That results in thousands of empty Task objects which contain a single
Task item.
Get rid of ExclusionState. Exclusion is now a simple weak_ptr to a Task.
If the weak_ptr is expired, the Exclusion is unlocked. If the weak_ptr
is not expired, it points to the Task which owns the Exclusion.
try_lock now appends the Task attempting to lock the Exclusion directly
to the owning Task, eliminating the need for Exclusion to have one.
This also removes the need to call insert_task separately, though
insert_task remains for other use cases.
With no ExclusionState there is no need for a string argument to
Exclusion's constructor, so get rid of that too.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make one class Barrier which is copiable, so we don't have to
have users making shared Barrier all the time.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It seems that readahead() does not work on btrfs, or at least it has
no discernable effect. Enable the workaround instead.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In current kernels there is a bug which leads to an infinite loop in
add_all_parents(). The bug is triggered by one thread running dedupe
while another runs logical_ino.
Work around this by ensuring that bees process never runs dedupe and
logical_ino ioctls at the same time. Any number of either can run
at the same time, but not one of both.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
For performance or workaround reasons we sometimes have to avoid doing
two conflicting operations at the same time, but we can still run any
number of non-conflicting operations in parallel.
MultiLocker (suggestions for a better class name welcome) blocks the
calling thread until there are no threads attempting to run a conflicting
operation.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
bees_sync() was an exception-trapping wrapper around fsync() which is
not needed in any of the contexts from which it was called:
1. dedupe operations implicitly flush the src data, so there is
no need to call fsync() to do that twice.
2. crawl position is written to a temporary file and renamed
over the original, which always forces a flush when the original
exists. On the first write, where there is no original, a
crash would result in starting over with an empty or hole-filled
beescrawl file, which is the initial state of bees. There is also
a long history of kernel bugs triggered by fsync() in this case.
3. we use unreadahead to trigger writeback for flushing the
hash table to persistent storage. Here is a space where we might
use fsync after all, as part of bees_unreadahead's emulation of
POSIX_FADV_DONTNEED, but we need to get read-once behavior from
the scanner before we can use this capability.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If there's an error while writing the crawl state, the state should
remain dirty. If the crawl state is successfully written, the state
is only clean if there were no changes to crawl state since the write
was committed. We need to release the lock while writing the state but
correctly set the dirty flag when the state is written successfully.
Replace the bool with a version number counter. Track the last version
successfully saved and the current version of the crawl state. The state
is dirty if these counters disagree and clean if they agree.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Document the overall purpose of the class and what some of the methods do,
particularly the ones with terrible names like 'insert_item' (which only
inserts an item after calling the Function).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We already had a function that was _similar_, so add decoding for compress
type NONE, give it a less specific name, and declare it in fs.h.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It really needs to be uint64_t, but at least it now doesn't contradict
the definition in the earlier header.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In commit 14ce81c08 "fs: get rid of silly base class that causes build
failures now" I neglected to set the dest_count field in the ioctl
arg structure, so bees master hasn't been deduping anything for about
three weeks.
I'd put a THROW_CHECK in here to catch this kind of bug in the future,
but it would be placed at exactly the point where this fix is.
Fixes: 14ce81c08
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we iterate over all roots to find the max transid, but the set of
all roots is empty, we'll get a nonsense number. Make sure that number
doesn't reach the crawling logic by killing it with an exception.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Yet another build failure of the form:
error: flexible array member fiemap... not at end of struct crucible::Fiemap...
bees doesn't use fiemap any more, so the fixes here are minimal changes
to make it build, not shining examples of C++ class design.
Signer-off-by: Zygo Blaxell <bees@furryterror.org>
They're all public because it's a struct, so there's no need to make
them explicit. clang-14 deprecates these.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The base class thing was an ugly way to get around the lack of C99
compound literals in C++, and also to make the bare ioctls usable with
the derived classes.
Today, both clang and gcc have C99 compound literals, so there's no need
to do crazy things with memset. We never used the derived classes for
ioctls, and for this specific ioctl it would have been a very, very bad
idea, so there's no need to support that either. We do need to jump
through hoops for ostream& operator<<() but we had to do those anyway
as there are other members in the derived type.
So we can simply drop the base class, and build the args object on the
stack in `do_ioctl`. This also removes the need to verify initialization.
There's no bug here since the `info` member of the base class was
never used in place by the derived class, but new compilers reject the
flexible array member in the base class because the derived class makes
`info` be not at the end of the struct any more:
error: flexible array member btrfs_ioctl_same_args::info not at end of struct crucible::BtrfsExtentSame
Fixes: https://github.com/Zygo/bees/issues/232
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Without this, if you install to a different PREFIX such as /usr/local
it will fail to recognize any arguments and if you use the systemd unit,
that makes --no-timestamps the first NOT_SUPPORTED_ARG which will get
passed to uuidparse, which doesn't recognize it and errors.
We had an unfortunate pattern of:
const BeesFileRange bfr;
shared_ptr<BeesContext> ctx;
// ...
BEESNOTE("foo " << bfr);
bfr.fd(ctx);
BEESNOTE("foo after opening: " << bfr);
If dump_status started running after the first BEESNOTE, but before
the second, then bfr.fd() might expose a single Fd object's shared_ptr
member to two threads at the same time (the thread running dump_status
and the thread running BEESNOTE) without protection by a lock. One of
the threads would see a partially-initialized Fd object, and the other
thread would crash on an assertion failure, e.g.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f4c4fde5537 in __GI_abort () at abort.c:79
#2 0x00007f4c4fde540f in __assert_fail_base (fmt=0x7f4c4ff4e128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5557605629dd "!m_destroyed", file=0x5557605627c0 "../include/crucible/namedptr.h", line=77, function=<optimized out>) at assert.c:92
#3 0x00007f4c4fdf4662 in __GI___assert_fail (assertion=assertion@entry=0x5557605629dd "!m_destroyed", file=file@entry=0x5557605627c0 "../include/crucible/namedptr.h", line=line@entry=77,
function=function@entry=0x555760562970 "crucible::NamedPtr<Return, Arguments>::Value::~Value() [with Return = crucible::IOHandle; Arguments = {int}]") at assert.c:101
#4 0x00005557605306f6 in crucible::NamedPtr<crucible::IOHandle, int>::Value::~Value (this=0x7f4a3c2ff0d0, __in_chrg=<optimized out>) at ../include/crucible/namedptr.h:77
#5 0x00005557605137da in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f4a3c2ff0c0) at /usr/include/c++/10/bits/shared_ptr_base.h:151
#6 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f4a3c2ff0c0) at /usr/include/c++/10/bits/shared_ptr_base.h:151
#7 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f4c4c5b5f28, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#8 std::__shared_ptr<crucible::IOHandle, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#9 std::shared_ptr<crucible::IOHandle>::~shared_ptr (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr.h:121
#10 crucible::Fd::~Fd (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at ../include/crucible/fd.h:46
#11 BeesFileRange::file_size (this=0x7f4c4e5ba4a0) at bees-types.cc:156
#12 0x0000555760513950 in operator<< (os=..., bfr=...) at bees-types.cc:80
#13 0x000055576050d662 in std::function<void (std::ostream&)>::operator()(std::ostream&) const (__args#0=..., this=0x7f4c4e5b9f60) at /usr/include/c++/10/bits/std_function.h:622
#14 BeesNote::get_status[abi:cxx11]() () at bees-trace.cc:165
#15 0x00005557604c9676 in BeesContext::dump_status (this=0x5557611c4de0) at bees-context.cc:89
#16 0x00005557605206fb in std::function<void ()>::operator()() const (this=this@entry=0x7f4c4c5b65f0) at /usr/include/c++/10/bits/std_function.h:622
#17 crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) (f=..., explainer=...) at error.cc:55
#18 0x000055576050aaa7 in operator() (__closure=0x5557611c52c8) at bees-thread.cc:22
#19 0x00007f4c501beed0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#20 0x00007f4c502c8ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#21 0x00007f4c4febddef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Fix by making BeesFileRange::m_fd really const (not just mutable),
then fix all the broken code referencing it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
valgrind doesn't understand ioctl arguments, so it does not know if
or when they initialize memory, and it complains about conditionals
depending on data that comes out of ioctls. That's a problem for bees,
where every decision we ever make is based on data an ioctl gave us.
Fix the initialization issue by using calloc instead of malloc for
ByteVectors when we are building for valgrind. Don't enable this by
default because all the callocs aren't necessary (assuming the rest
of the code is correct) and hurt performance.
Define BEES_VALGRIND in localconf to activate, e.g.
echo CCFLAGS += -DBEES_VALGRIND=1 >> localconf
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It turns out we never set m_dirty's initial value. This is not a
practical problem because 1) it's mostly harmless if m_dirty is spuriously
true, 2) we set it to true every time bees scans a data block, and 3)
the allocation happens early in startup when most memory allocations
are using zero-filled pages, so it's probably getting a false value at
construction in most cases.
valgrind complains about it, so it has to go.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Once the physical addresses are known, put them where they can be
seen in BEESTATUS as well as the log.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are kernel bugs in LOGICAL_INO from time to time; however, we
can't avoid these bugs by serializing LOGICAL_INO calls.
It hasn't been used for some time, so remove the code and
less-than-completely-accurate comments.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
For one thing, it should _say_ that there are too many duplicates.
We were making the user read the manual to find that out.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Forcing the entire hash table into immediate writeback causes crippling
write latencies at shutdown. Even discarding pages as they are read in
at startup can trigger a writeback latency spike if the pages are dirty
at read time.
Better to let the VM subsystem handle this on its own.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Putting this information in the logs saves us from having to ask for
the kernel version and machine name every time.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Update thread_local task state pointers while locked. This avoids
potential concurrent access of the pointers while making copies of them.
Verify that the queue is really empty after splicing lists, and the
current consumer is really gone after swapping the empty one.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We need random numbers in more places, so centralize the engines.
Initialize with a proper random seed so every worker thread gets
different behavior.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Fix the locking order for the case where an exception is thrown
in shared_ptr's allocator.
More const.
Drop the explicit closure return type since the compiler can deduce it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Sprinkle in some asserts to make sure compilers aren't getting creative.
This may introduce a new compiler dependency, as I suspect older versions
of GCC don't support this syntax.
It definitely needs a new compiler flag to suppress a warning when some
fields are not explicitly initialized. If we've omitted a field, it's
because it's a field we don't know (or care) about, and we want that
thing initialized to zero.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In commit d9e3c0070b "context: stop creating
new refs when there are too many already" we added a new counter, but didn't
document it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If there is only one Task in the post exec queue, we can
simply insert that Task instead of creating a task to hold
a post exec queue of one item.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
At the end of scanning one extent, in theory we do not need that extent
any more. In practice, it hurts benchmark scores if we drop the extents
after reading them.
Add a comment to note this where we put the bees_unreadhead call.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BEESNOTE can only be seen if the status thread is running at the time,
making the log of activities during shutdown incomplete.
Wake up the status thread early during shutdown so the logged sequence
of shutdown actions is complete.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In the current architecture we can't directly measure the physical extent
size, and we can't make good decisions with the extent data (reference)
item alone. If the early return is enabled here, there is a small speedup
and a large drop in dedupe hit rate, especially when extent splits occur.
Leave the early return commented for now, but collect the event statistics.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Tree searches are all looking for specific item types. Skip over any
item types we are not interested in when resetting the search key for
the next search.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When we are searching the btrfs metadata trees, we usually want only
one type of item. If the last item in a search result is not of the
desired type, we can restart the search at the next possible key with
that item type, potentially skipping over some uninteresting items we
would otherwise have to fetch, process, and discard.
Also remove a bug in the previous next_min code that would skip over
items if the offset overflowed and the next objectid in the tree had a
lower item type number than the previous objectid. This doesn't seem
to be a bug that has ever happened, as it would require a file to roll
over in the offset field.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BtrfsIoctlSearchKeyV2's constructor now fills in nr_items = 1, so we
don't need to set it explicitly any more.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
vector_copy_struct constructed a std::vector<uint8_t> from a fixed-size
struct. ByteVector replaces std::vector<uint8_t> and has a template
constructor which does the same thing as vector_copy_struct, so there
is no longer a need for this function.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Spanner was a workaround for terrible std::vector _copy_ performance,
but it turns out that std::vector has terrible _allocator_ performance
(compared to an implementation based on malloc and memcpy). Spanner is a
workaround for the copy performance issue, so it doesn't help very much.
Refraining from using vector at all is much better.
Now that all code that used Spanner has been converted to ByteVector,
there's no further need for Spanner<uint8_t>, which was the only type
it was ever used for.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We can simply remove the template specializations, but if we do that, then
existing code might accidentally write out the vector<uint8_t> struct.
Prevent regressions by deleting the vector specializations, making any
code that uses them fail to build.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The vector<uint8_t> in the hash table doesn't hurt very much--only a few
microseconds per 128K hash block.
The vector<uint8_t> in BeesBlockData hurts a bit more--we run that
constructor thousands of times per second.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add support for pread and pwrite of ByteVector objects alongside
vector<uint8_t>. A later commit will delete the template specializations
for vector<uint8_t>, but existing users have to be updated to use
ByteVector first.
Nothing currently uses vector<char>, so we can delete that immediately.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Switch various methods in fs to use ByteVector to cut down on the number
of slow allocations and copies.
Automatically determine the correct size for TREE_SEARCH_V2 buffers
based on the number of items requested, and grow the buffer as needed.
This eliminates the need to cache some objects that were heavy to create.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Now that we can guess the size more or less automatically, there's
no need to make it unnecessarily large.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
After some benchmarking, it turns out that std::vector<uint8_t> is
about 160 times slower than malloc(). malloc() is faster than "new
uint8_t[]" too. Get rid of std:;vector<uint8_t> and replace it with
a lightweight wrapper around malloc(), free(), and memcpy().
ByteVector has helpful methods for the common case of moving data to and
from ioctl calls that use a fixed-length header placed contiguously with a
variable-length input/output buffer. Data bytes are shared between copied
ByteVector objects, allowing a large single buffer to be cheaply chopped
up into smaller objects without memory copies. ByteVector implements the
more useful parts of the std::vector API, so it can replace std::vector
objects without needing an awkward adaptor class like Spanner.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Sometimes we need to check constraints on 4 variables at once.
It would be nice if variadic macros in C++ were also polymorphic.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Previously, when the bees send workaround is enabled, bees would
immediately advance the subvol's crawl status as if the entire subvol
had been scanned.
If the subvol is later made read-write, or if the workaround is disabled,
bees sees that the subvol has already been marked as scanned. This is
an unfortunate result if the subvol is inadvertently marked read-only
or if bees is inadvertently run with the send workaround disabled.
Instead, (almost) completely ignore the subvol: don't advance the crawl
pointer, don't consider the subvol in the list if searchable roots, and
don't consider the subvol when calculating min_transid for new subvols.
The "almost" part is: if the subvol scan has not yet started, keep its
start timestamp current so it won't mess up subvol traversal performance
metrics.
Also handle exceptions while determining whether a subvol is read-only,
as those apparently do happen.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
install -Dm644 scripts/beesd.conf.sample $(DESTDIR)/$(ETC_PREFIX)/bees/beesd.conf.sample
will expand to //etc/bees/beesd.conf.sample. This patch removes the duplicated /
Since we started locking down the beesd service, we no longer have
privileges to do some things. Have systemd do it for us instead.
Fixes: #195
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In fiemap.h the members of struct fiemap are declared as __u64, but the
FIEMAP_MAX_OFFSET macro is an unsigned long long value:
$ grep FIEMAP_MAX_OFFSET -r /usr/include/
/usr/include/linux/fiemap.h:#define FIEMAP_MAX_OFFSET (~0ULL)
$ grep fe_length -r /usr/include/
/usr/include/linux/fiemap.h: __u64 fe_length; /* length in bytes for this extent */
This results in a type mismatch error on architectures like ppc64le:
fiemap.cc:31:35: note: deduced conflicting types for parameter 'const _Tp' ('long unsigned int' and 'long long unsigned int')
31 | fm.fm_length = min(fm.fm_length, FIEMAP_MAX_OFFSET - fm.fm_start);
| ~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Work around this by copying the macro into a uint64_t constant,
and not using the macro any more.
Fixes: https://github.com/Zygo/bees/issues/194
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The hash table is one of the few cases in bees where a non-trivial amount
of page cache memory will be used in a predictable way, so we can advise
the kernel about our IO demands in advance.
Use WILLNEED to prefetch hash table pages at startup.
Use DONTNEED to trigger writeback on hash table pages at shutdown.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In theory, we don't need the pread() loop, because the kernel will do a
better job with readahead().
In practice, we might still need the pread() code, as the readahead will
occur at idle IO priority, which could adversely affect bees performance.
More testing is required.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The assignment operator will use member-wise assignment, which
assumes the object's this pointer is aligned. That doesn't
happen when the object in question is part of a btrfs search
result, and aarch64 faults over it.
Use memcpy instead, which has no alignment constraints.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Btrfs mount options effects all mount points using the same Btrfs
partition, so specifing it per-mount is useless.
Also, common mount options like `noatime,nosuid,nodev,noexec` has little
to no effect on beesd, so it's just better and simpler to remove this.
Signed-off-by: Jiahao XU <Jiahao_XU@outlook.com>
I've verified that using this setup, user will be able to access the log
in /run/bees, but cannot access the mounted filesystem.
Signed-off-by: Jiahao XU <Jiahao_XU@outlook.com>
Like filefrag, fiemap was defaulting to FIEMAP_FLAG_SYNC, and providing no
option to turn it off. This prevents observation of delayed allocations,
making fiemap less useful.
Override the default flag setting so fiemap gets the current
(i.e. unflushed) extent map state.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
LOGICAL_INO_V2 has a maximum limit of 655050 references per extent.
Although it no longer has a crippling performance problem, at roughly
two seconds to process extent, it's too slow to be useful.
When an extent gains an absurd number of references, stop making any
more. Returning zero extent refs will make bees believe the extent
was deleted, and it will remove the block from the hash table.
This helps speed processing of highly duplicated large files like
VM images, and the cost of a slightly lower dedupe hit rate.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The default name of a newly constructed thread is apparently the name
of the thread that created it. That's very misleading when there are
a lot of TaskConsumer threads and they have nothing to do, so set the
name of each TaskConsumer thread as soon as it is created.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In 15ab981d9e "bees: replace uncaught_exception(), deprecated in C++17",
uncaught_exception() was replaced with current_exception(); however,
current_exception() is only valid after an exception has been captured
by a catch block.
BeesTracer wants to know about exceptions _before_ they are caught,
so current_exception() is not useful here.
Instead, conditionally compile using uncaught_exception() or
uncaught_exceptions(), selected by C++ standard version, and make
bees stack traces work again.
Fixes: 15ab981d9e "bees: replace uncaught_exception(), deprecated in C++17"
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This allows these components to be used by test executables without
pulling in all of bees, and more rapidly iterate their code.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add some conditionally-compiled debug code, including an in-memory log
of what ExtentWalker does. Dump that log on exceptions.
If we loop too many times in a debug build, kill the process so we can
stack trace. In non-debug builds just throw a normal exception.
Grow the step size instead of shrinking it, to reduce the number of
binary search iterations.
Prevent a bug where the step size bottoms out before positioning the
target extent in the middle of the result vector.
Use the first extent for "first_extent", instead of the 3rd.
Get rid of some redundant checks.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When a file ends with a hole, ExtentWalker synthesizes a hole extent record
to cover the distance between the last ipos and EOF. Unfortunately, ipos
was incremented by the number of items in the result vector instead. Fix
that by incrementing by hole_extent.size().
While we're here, fix up some of the other data quality logic, including
a useless THROW_CHECK that was nothing but workarounds for earlier bugs.
Fixes: https://github.com/Zygo/bees/issues/26
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Two new tree mod log bugs #5 and #6 (uncovered by the zoned IO work,
though #6 has been seen in the wild on 5.10.29).
Tweak the next of some of the workarounds.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Some users are hitting an exception somewhere in crawl_transid, which
forces bees to return back to the transid_max calculation over and over.
Also out-of-range transids.
Add some BEESTRACE so we can see what we were doing in the exception
handler.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Currently if crawl throws an exception, we don't have basic information
about what was being crawled or even if the crawler was running at all.
These traces also help identify the causes of early exception failures.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This might be interesting information, though most of the motivation for
this evaporated when kernel 5.7 came out.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There seem to be multiple ways to do readahead in Linux, and only some
of them work. Hopefully reading the actual data is one of them.
This is an attempt to avoid page-by-page reads in the generic dedupe code.
We load both extents into the VFS cache (read sequentially) and hope they
are still there by the time we call dedupe on them.
We also call readahead(2) and hopefully that either helps or does nothing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This enables us to correlate FD cache clears with external events such
as btrfs inode eviction storms.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Report the number of Task objects that currently exist as well as the number
on the global work queue.
THREADS (work queue 298 of 2385 tasks, 16 workers):
This helps spot leaks, since Task objects that are blocked on other Task
post-exec queues are otherwise invisible.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Testing sometimes crashes during exec of the first Task object, which
triggers construction of TaskConsumer threads. Manage the life cycle
of the thread more strictly--don't access any methods of TaskConsumer
or std::thread until the constructor's caller's lock on TaskMaster
is released.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Task::run() would schedule a new execution of Task, unless it was waiting
on a queue for execution. This cannot be implemented with a bool,
since a Task might be included in multiple queues, and should still be
in waiting state even when executed in that case.
Replace the bool with a counter. run() and append() (but not
append_nolock) increment the counter, exec() decrements the counter.
If the counter is non-zero when run() or append() is called, the Task
is not scheduled.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This is a simple lightweight counter that tracks the number of Task
objects that exist. Useful for leak detection.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Quite often we want to execute task B after task A finishes executing,
especially if tasks A and B attempt to acquire locks on the same objects.
Implement that capability in Task directly: each Task holds a queue
of Tasks which will be executed strictly after this Task has finished
executing, or if the Task is destroyed.
Add a local queue to each TaskConsumer. This queue contains a list
of Tasks which are to be executed by a single thread in sequential
order. These tasks are executed before fetching any tasks from
TaskMaster.
Each time a Task finishes executing, the list of tasks appended to the
recently executed Task are spliced at the beginning of the thread's
TaskConsumer local queue. These tasks will be executed in the same
thread in the same order they were appended to the recently executed Task.
If a Task is destroyed with a post-execution queue, that queue is
also inserted at the front of the current TaskConsumer's local queue.
If a Task is destroyed or somehow executed outside of a TaskConsumer
thread, or a TaskConsumer thread is destroyed, the local queue of Tasks
is wrapped in a "rescue_task" Task, and spliced before the head of the
global queue. This preserves the sequential ordering of tasks.
In all cases the order of sequential execution of Tasks that are
appended to another Task is preserved.
The unused queue insertion functions are removed.
Exclusion is now simply a mutex, a bool, and a Task with an empty
function. Tasks that queue up waiting for the mutex are stored in
Exclusion's Task, and Exclusion simply runs that task when the
ExclusionState is released.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Change documentation and comments to use the word "dedupe," not "dedup"
as found in circa-3.15 kernel sources.
No changes in code or program output--if they used "dedup" before, they
will continue to be spelled "dedup" now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Fd's cache does not handle changes in the state of its IOHandle parameter.
If we allow:
Fd f;
f->close();
then Fd ends up caching a pointer to a closed Fd, and will become very
badly confused if a new Fd appears with the same int identifier.
Fix by removing the close method.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Drop the ListType alias because we only use it once. Rename ListRep to
PoolRep to better reflect what it does.
We don't need the Pool to be available to handle destroyed Pool::Handle
objects. A weak_ptr in the Handle would detect the Pool has been
destroyed, so we don't need to track that ourselves. As a bonus, we can
destroy the PoolRep object as soon as the Pool has been destroyed, delayed
only if there is a Handle object currently executing its destructor.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Higher CPU core counts became more common, and kernel bugs became less
common, since the arbitrary 8-thread limit was introduced. We can remove
the limit now, and treat any remaining scaling inefficiency as a bug to
be removed.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The dependency was missing, so changes to the library would not trigger
a rebuild of the bees binary.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Support for multiple BeesContext objects sharing a FdCache was wasting
significant space and atomic inc/dec memory cycles for no good reason
since the shared-FdCache feature was deprecated.
open_root and open_root_ino still need a BeesContext to work. Pass the
BeesContext pointer through the function object instead of the cache
key arguments.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
pthread_cancel doesn't really work properly. It was only being used in
bees to bring threads to a stop if the BeesContext is destroyed early.
It is frequently implicated in core dump reports because of the fragility
of the C++ iostream / C stdio / library infrastructure, particularly
surrounding upgrades on the host running bees. The pthread_cancel call
itself often simply fails even when it doesn't call terminate().
Defer creation of the status and progress threads until after the
BeesContext::start method is invoked. At that point, the existing
ask-threads-nicely-to-stop code is up and running, and normal condvars
can be used to bring bees to a stop, without having to resort to
pthread_cancel.
Since we're deleting half of the BeesContext constructor in this change,
let's remove the other half too, and put an end to the deprecated support
for multiple BeesContexts sharing a process. It's still possible to run
multiple BeesContexts, but they will not share a FD cache. This will
allow the FD cache's keys to become smaller and hopefully save some
memory later on.
Fixes: #171
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
C99's "{ 0 }" notation for filling in a struct with all zeros was not
included in the C++11 standard, so gcc doesn't implement it and neither
does clang.
gcc does (did?) have issues with warnings on the same code in C99,
complaining about uninitialized struct members when "{0}" explicitly
initializes every member to a zero value. These issues don't apply in
the C++ code where NTOA_TABLE_ENTRY_END is used.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Get rid of an assert in bits_ntoa. Throw an exception instead.
Fix hex formatting (adding "0x" before a decimal number is not
the correct way to format hex strings).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The kernel from such an old distro version likely has several unfixed
bugs. Better not to support it at all.
Users who can upgrade the kernel are probably also sophisticated enough
to fix the build issues too.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The weird things distros do to the path where uuid.h gets installed
have broken bees builds for the last time.
We were only using uuid to support a legacy feature that was removed
over four years ago.
Hypothetical users who are upgrading directly from bees v0.1 should
probably restart all the crawlers anyway--there were bugs. Also, if any
such users exist, I respect their tremendous patience with the horrible
performance all these years--bees got about 30x faster since v0.1.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The slow backrefs performance improvement is confirmed by reports from
multiple users:
* Me (5.4.60 + backref patches, 5.7 to 5.11)
* https://github.com/Zygo/bees/issues/161 (5.8)
* https://github.com/Zygo/bees/issues/162 (5.8)
* IRC user S0rin (5.4.88 + backref patches)
The issue still exists, but at a significantly reduced scale: now about
2 ms of CPU per ref on a fast machine.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The Linux kernel's btrfs headers are better than the libbtrfs-dev headers:
- the libbtrfs-dev headers have C++ language compatibility issues
- upstream version in Linux kernel is more accurate and up to date
- macros in libbtrfs-dev's ctree.h hide information that would
enable bees to perform runtime buffer length checking
- enum types whose presence cannot be detected with #ifdef
When accessing members of metadata items from the filesystem, we want
to verify that the member we are accessing is within the boundaries of
the item that was retrieved; otherwise, a memory access violation may
occur or garbage may be returned to the caller. A simple C++ template,
given a pointer to a structure member and a buffer, can determine that
the buffer contains enough bytes to safely access a struct member.
This was implemented back in 2016, but left unused due to ctree.h issues.
Some btrfs metadata structures have variable length despite using a
fixed-size in-memory structure. The members that appear earliest in
the structure contain information about which following members of the
structure are used. The item stored in the filesystem is truncated after
the last used member, and all following members must not be accessed.
'btrfs_stack_*' accessor macros obscure the memory boundaries of the
members they access, which makes it impossible for a C++ template to
verify the memory access. If the template checks the length of the
entire structure, it will find an access violation for variable-length
metadata items because the item is rarely large enough for the entire
structure.
Get rid of all the libbtrfs-dev accessor macros and reimplement them
with the necessary buffer length checks.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Apparently it is missing in newer Linux headers, making
builds fail. We don't need it, so remove it.
Closes: #160
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make these workarounds configurable in src/bees.h instead of #if 0
code blocks. Someday we'll make the constants in bees.h configurable
through a file or similar.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Spanner<Iterator> turns a pair of pointers into a sequence container
with several of vector's methods.
A partial specialization of make_spanner is provided which uses
shared_ptr as the beginning of the range. Some of the Spanner code
is a questionable hack in support of this.
C++20 has ranges and span, but neither is worth moving the minimum
C++ standard forward.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When we are using non-copying containers, we can't call resize() on them.
get_struct_ptr is essentially a pointer cast, so we will end up with a
pointer to a struct that extends beyond the boundaries of the container.
As long as the btrfs metadata is not corrupted, we should not have too
many problems.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Use uint8_t when we mean uint8_t, i.e. vector<uint8_t> instead of
vector<char>.
Add a template parameter instead of vector so we can swap in a
non-copying data type.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Define a local copy of the header that has fields for the csum type
and length, so we can build in places that haven't caught up to kernel
5.5 headers yet.
The reason why the csum type and length are not unconditionally filled
in eludes me. csum_length is necessarily non-zero, and the cost of
the conditional is worse than the cost of the copy, so the whole flags
dance is a WTF...but it's part of the kernel API now, so it's too late
to NAK it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Rewrite Fd using a much simpler named resource template class with
a more straightforward derivation strategy.
Behavior change: we no longer throw an exception while calling get_fd()
on a closed Fd. This does not seem to bother any current callers except
for the tests.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
fiewalk and fiemap depend on a lot of crucible, and incremental builds
fail hard without proper dependency tracking.
All binaries must be rebuilt when makeflags changes. This dependency
exists already in lib and test, but src was missing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
NamedPtr provides reference-counted handles to named objects. The object
is created the first time the associated name is used, and stored under
the associated name until the last handle is destroyed. NamedPtr may
itself be destroyed while handles are still active.
This template is intended to replace ResourceHandle with a more general
and less invasive implementation.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Use a single static variable located in the library, instead of
having a separate one for each compilation unit.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
std::list and std::map both have stable iterators, and list has the
splice() method, so we don't need a hand-rolled double-linked list here.
Coalesce insert() and operator() into a single function.
Drop the unused prune() method.
Move destructor calls for cached objects out from under the cache lock.
Closing a lot of files at once is already expensive, might as well not
stop the world while we do it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Due to a missing dependency, tests are not rebuilt when the library
changes, so tests return false results after library source changes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we create an identical .version.cc then don't bother keeping it.
This prevents libcrucible from rebuilding if there are no other changes,
which in turn prevents all the binaries from rebuilding unconditionally.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Some versions of linux-libc header files define a macro named 'crc32c'.
We want to use that name too, so #undef it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Now that tempfiles are using pool checkin functions to control their
size, we don't need a size limit in realign().
We keep the limit in make_copy because it's a sanity check against
letting a multi-terabyte copy operation slip through.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Get rid of the thread-local TempFiles and use Pool instead. This
eliminates a potential FD leak when the loadavg governor repeatedly
creates and destroys threads.
With the old per-thread TempFiles, we were guaranteed to have exclusive
ownership of the TempFile object within the current thread. Pool is
somewhat stricter: it only guarantees ownership while the checked-out
Handle exists. Adjust the users of TempFile objects to ensure they hold
the Handle object until they are finished using the TempFile.
It appears that maintaining large, heavily-reflinked, long-lived temporary
files costs more than truncating after every use: btrfs has to write
multiple references to the temporary file's extents, then some commits
later, remove references as the temporary file is deleted or truncated.
Using the temporary file in a dedupe operation flushes the data to disk,
so nothing is saved by pretending that there is writeback pipelining and
trying to avoid flushes in truncate. Pool provides usage tracking and
a checkin callback, so use it to truncate the temporary file immediately
after every use.
Redesign TempFile so that every instance creates exactly one Fd which
persists over the lifetime of the TempFile object. Provide a reset()
method which resets the file back to the initial state and call it from
the Pool checkin callback. This makes TempFile's lifetime equivalent to
its Fd's lifetime, which simplifies interactions with FdCache and Roots.
This change means we can now blacklist temporary files without having
an effective memory leak, so do that. We also have a reason to ever
remove something from the blacklist, so add a method for that too.
In order to move to extent-centric addressing, we need to be able to
reliably open temporary files by root and inode number. Previously we
would place TempFile fd's into the cache with insert_root_ino, but the
cache would be cleared periodically, and it would not be possible to
reopen temporary files after that happened. Now that the TempFile's
lifetime is the same as the TempFile Fd's lifetime, we can have TempFile
manage a separate FileId -> Fd map in Roots which is unaffected by the
periodic cache clearing. BeesRoots::open_root_ino_nocache will check
this map before attempting to open the file via btrfs root+ino lookup,
and return it through the cache as if Roots had opened the file via btrfs.
Hold a reference to BeesRoots in BeesTempFile because the usual way
to get such a reference now throws an exception in BeesTempFile's
destructor.
These changes make method BeesTempFile::create() and all methods named
insert_root_ino unnecessary, so delete them.
We construct and destroy TempFiles much less often now, so make their
constructor and destructor more informative.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Pool is a place to store shared_ptrs to generated objects (T) that are
too expensive to create and destroy between individual uses, such as
temporary files. Objects in a Pool have no distinct identity
(contrast with Cache or NamedPtr).
Users of the Pool invoke the Pool function call overload and "check out"
a shared_ptr<T> for a T object from the Pool. When the last referencing
shared_otr<T> is destroyed, the T object is "checked in" to the Pool.
Each call of the Pool function overload checks out a shared_ptr<T> to a T
object that is not currently referenced by any other public shared_ptr<T>.
If there are no existing T objects in the Pool, a new T is constructed
by calling the generator function.
The clear() method destroys all checked in T objects owned by the Pool
at the time the method is called. T objects that are checked out are
not affected by clear(), and they will be stored in the Pool when they
are checked in.
If the checkout function is provided, it is called on a shared_ptr<T>
during checkout, before returning to the caller.
If the checkin function is provided, it is called on a shared_ptr<T>
before returning it to the Pool. The checkin function must not throw
exceptions.
The Pool may be destroyed while T objects are checked out of the Pool.
In that case, when the T objects are checked in, the T object is
immediately destroyed without calling the checkin function.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A prealloc extent reference can be deduped immediately and asynchronously.
There is no need to slow down extent scanning to do it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
I was never able to prove a connection between fsync() and deadlock bugs.
There were too many deadlock bugs to be able to isolate a bug that is
triggered specifically by fsync.
Update the comment (which has been unchanged since kernel 4.14). We still
may want to do fsync() on temporary files someday, but there's a full
internal API rewrite between here and there.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Perf blames this operator for >1% of instructions with -O2, and
70% of instructions without -O2.
Let the compiler inline the function.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The requested size may not match the final size of the container,
so consistently use the container's size after prepare(), not the
requested size.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Number of items should be low enough that we don't have too many stale
items, but high enough to amortize system call overhead to a reasonable
ratio.
Number of bytes should be constant: one worst-case metadata page (the
btrfs limit is 64K, though 16K is much more common) so that we always
have enough space for one worst-case item; otherwise, we get EOVERFLOW
if we set the number of items too low and there's a big item in the tree,
and we can't make further progress.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are lots of ways the search can fail, but it's hard to pick one
without knowing the parameters.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It's a pain to read, edit, and format large blocks of text in C++ code,
so rip the usage message out of bees.cc and put it in a plain text file.
Use a minimal translator to convert it into a C string.
While we're here, remove the multiple roots feature from the command
line synopsis, as we don't really support it any more. Also clarify
that "id 5" is "subvol id 5", and describe in one sentence what
workaround-btrfs-send does.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Silence the unused variable warning. The compiler is correct, but we
may implement line-level debug at some point in the future, so we
want to keep the member and parameters.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Remove unused function getenv_or_die. All of our environment variable
parameters are optional or have default values.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Get rid of unused template instantiation.
Drop the unused realtime signals from the ntoa table. If in the future
we really need to solve clang's issue with them, we'll address it then.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A long time ago, when bees used dedicated threads to scan each subvol, the
calculation of the "dedup_unique_bytes" statistic was still wrong.
This stat can only be calculated when dedupe runs on extent data items
instead of extent reference items. Remove the stat variable until
that happens.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There was a 4th tree mod log crash that showed up in testing. It can
be reproduced or eliminated by applying or reverting d2311e698578
("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
to a 5.4.x kernel before 5.4.54.
Unfortunately, the test can only run if several other patches that
fixed other bugs in d2311e698578 are applied or removed at the same time.
Commit d2311e698578 introduces a bug which destroys filesystems under test
long before tree mod log failures can be reproduced in testing. One of
those patches also fixes tree mod log issue #4. I do not know which one,
but since kernels after 5.1 cannot run without all of those patches, I do
not think it matters.
Tree mod issue #4 is the reason why the tree mod workaround is still
required on all kernels before 5.4. The issue still exists on older
LTS kernels, e.g. 4.9.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Rewrite the text related to 'btrfs send' to clarify that the send
workaround is no longer necessary to avoid kernel crashes, but still
useful because send and dedupe still do not work at the same time.
Replace "many backref code changes" with a specific commit reference,
and improve the grammar of some issue descriptions.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Apparently there's Github Flavored Markdown, and there's the markup
language that github uses, and they are distinct things.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Present known kernel bugs in table form with issue descriptions,
fixed and broken kernel versions, and references to fixes.
Update kernel version recommendations to include information on kernel
versions up to 5.8.14.
Reduce emphasis on data corruption bugs which are 1) two or more
years old now, and 2) much less bad than the bugs in kernel 5.1.
Add deprecation warning for kernels before 4.15.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Prefer to use cmark-gfm with extension 'table' so we can use tables in
locally-generated HTML files. If cmark-gfm is not installed then
fall back to some other Markdown implemeentation, but the tables will
be broken on every other implementation I have tried so far.
Also make the HTML output depend on the Makefile, since there may be
document translation options specified there (like '-e table' or an
entirely different Markdown implementation).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
uncaught_exception() had only the one valid use case, and it can be
reimplemented by literally calling current_exception() instead.
current_exception() has several valid use cases, so it is not likely
to be deprecated any time soon.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We cannot use BeesContext::roots() until after
BeesContext::set_root_path() has been called.
Save up the parameter settings until then.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
"Storm of softlockups" starts with a simple BUG_ON, but after the
BUG_ON, all cores that are waiting on spinlocks get stuck.
The _first_ kernel call trace is required to identify the bug.
At least two such bugs have been identified.
Add some notes about the conflict between LOGICAL_INO and balance,
and the recently added bees workaround.
Update the gotchas page for balances to point to the kernel bugs page.
Remove "bees and the full balance will both work correctly" as that
statement is not true.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This avoids some kernel bugs. One of them is fixed in 5.3.4 and later:
efad8a853a "Btrfs: fix use-after-free when using the tree modification log"
There are apparently others in current kernels, so for now just put bees
on pause until the balance is done.
At some point we may want to provide an option to disable this
workaround; however, running bees and balance at the same time makes
neither particularly fast, so maybe we'll just leave it this way.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Saying just "This feature" at some log levels could be puzzling. Let's
remove this message, the feature works without problems for a year.
Signed-off-by: Kai Krakow <kai@kaishome.de>
In version 2.30 glibc added it's own gettid() function. This resulted in
"error: call of overloaded ‘gettid()’ is ambiguous" because gettid()
now exists in both namespace crucible and std.
For now, use explicit references to namespace crucible. This continues
to work with new and old libc without having to test specific library
versions.
At some point, glibc gettid() will be deployed widely enough that we can
remove the crucible version entirely.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Localize the hash function in bees to a single spot to make it easier
to change later (or at runtime).
Remove some code that was using a property of CRC as an optimization.
The optimization doesn't work for other hash functions, and running the
CRC function takes more CPU time than the optimization saved.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
CityHash64 appears to be the fastest available block hashing algorithm
that is good enough for dedupe. It takes much less CPU than the CRC64
function, and avoids hash-collision problems with file formats that use
CRC64 as an integrity check on 4K block boundaries.
Extracted from git://github.com/google/cityhash with the "CRC" hash
functions (which require Intel/AMD CPU support) removed. We don't
need those, and they introduce a new (if only theoretical) build-time
dependency.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We got away with this because GCC 4.8 (and apparently every GCC prior
to 9) didn't notice or care, and because there is nothing referenced
inside the lambda function body that isn't accessible from any other
kind of function body (i.e. the capture wasn't needed at all).
GCC 9 now enforces what the C++ standard said all along: there is
no need to allow capture-default in this case, so it is not.
Fix by removing the offending capture-default.
Fixes: https://github.com/Zygo/bees/issues/112
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It is not possible to emulate extent-same by clone in a safe way.
EXTENT_SAME has been supported in btrfs since kernel 3.13, which
is much too old to contemplate running bees on.
Remove this dangerous and unused function.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We are getting a lot of exceptions when an inline extent is too large for the
TREE_SEARCH_V2 buffer. This disrupts ExtentWalker's extent boundary
search when there is an inline extent at the beginning of a file:
# fiemap foo
Log 0x0..0x1000 Phy 0x0..0x1000 Flags FIEMAP_EXTENT_NOT_ALIGNED|FIEMAP_EXTENT_DATA_INLINE
Log 0x1000..0x2000 Phy 0x7307f9000..0x7307fa000 Flags 0
Log 0x2000..0x3000 Phy 0x731078000..0x731079000 Flags 0
Log 0x3000..0x5000 Phy 0x73127d000..0x73127f000 Flags FIEMAP_EXTENT_ENCODED
Log 0x5000..0x6000 Phy 0x73137a000..0x73137b000 Flags 0
Log 0x6000..0x7000 Phy 0x731683000..0x731684000 Flags 0
Log 0x7000..0x8000 Phy 0x73224f000..0x732250000 Flags 0
Log 0x8000..0x9000 Phy 0x7323c9000..0x7323ca000 Flags 0
Log 0x9000..0xb000 Phy 0x732425000..0x732427000 Flags FIEMAP_EXTENT_ENCODED
Log 0xb000..0xc000 Phy 0x732598000..0x732599000 Flags 0
Log 0xc000..0xd000 Phy 0x7325d5000..0x7325d6000 Flags FIEMAP_EXTENT_LAST
# fiewalk foo
exception type std::system_error: BTRFS_IOC_TREE_SEARCH_V2: /tmp/foo at fs.cc:844: Value too large for defined data type
Normally crawlers simply skip over inline extents, but ExtentWalker will
seek backward from the first non-inline extent to confirm that it has
an accurate starting block for the target extent. This fails when it
encounters the first inline extent.
strace reveals that buffer size is too small for the first extent,
as seen here:
ioctl(3, BTRFS_IOC_TREE_SEARCH_V2, {key={tree_id=258, min_objectid=78897856, max_objectid=UINT64_MAX, min_offset=0, max_offset=UINT64_MAX, min_transid=0, max_transid=UINT64_MAX, min_type=BTRFS_EXTENT_DATA_KEY, max_type=BTRFS_EXTENT_DATA_KEY, nr_items=16}, buf_size=1360} => {buf_size=1418}) = -1 EOVERFLOW (Value too large for defined data type)
Fix this by increasing the buffer size until it can handle the largest
possible object on the largest possible btrfs metadata page (65536 bytes).
BtrfsExtentWalker already has optimizations to minimize the allocation
cost, so we don't need any changes there.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Some build environments (ARM? AARCH64?) do not have the fields
si_lower and si_upper in siginfo.
bees doesn't need them, so don't try to access them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Update the version ranges on the dependencies.
FIXME/TODO: start dropping early versions that don't work with current
code?
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
flushoncommit or not-flushoncommit isn't really a bees matter--it's
a sysadmin's tradeoff between reliability and performance. bees does
not affect that tradeoff because all dedupe src extents are flushed, so
bees introduces no *new* data loss risks in the noflushoncommit
case--i.e. any data that you could lose while running bees, you'd also
lose when not running bees.
Note that the converse is not true: bees might trigger flushing on
data that would not normally have been flushed with noflushoncommit,
and improve data integrity after a crash as a side-effect of dedupe
operations. The risks of noflushoncommit might be reduced by running
bees. I don't have evidence based on experimental data to support that
conclusion, so I'll just leave this possibility as a rumor in a commit
log message.
lvmcache can be moved from the "bad" list to the "good" list now.
bcache remains in the "bad" list due to some non-data-losing failures
that only seem to happen with bcache.
Add a note about CPUs with strange endianness or page sizes, as nobody
seems to have tried those.
Remove "at great cost" from the btrfs send workaround. The cost is
the cost, there is no need to editorialize.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The existence of information about known data corruption bugs should be
visible from the top-level page.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
* comprehensive list of kernels with bees-triggered corruption bug fixes
* deadlock between dedupe and rename is now fixed (in some places)
* compressed data corruption is now fixed (in more places)
* btrfs send fix for one bug is now merged in 5.2-rc1, another bug remains
* retired the bcache/lvmcache bug (can't reproduce those bugs any more,
although I *can* reproduce an interesting non-destructive bcache bug)
* new minor bug entries for two harmless kernel warnings
* new entry for storm-of-soft-lockups
Fixes: https://github.com/Zygo/bees/issues/107
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This is especially useful when dynamic load management allocates more
worker threads than active tasks, so the extra threads are effectively
invisible.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This may help users understand some of the things that happen inside
bees...or it may just be horribly long and confusing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Enable much simpler Task management: each time a Task needs to be done
at least once in the future, simply invoke the run() method on the Task.
The Task will ensure that it only runs once, only appears in a queue
once, and will run again if a run request is made while the Task is
already running.
Make the queue policy a member of the Task rather than a method. This
enables Tasks to reschedule themselves, possibly on the appropriate queue
if we have more than one of those some day.
This happens to make Tasks more similar to Linux kernel workers.
This similarity is coincidental, but not undesirable.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Introduce a mechanism to suppress exceptions which do not produce a
full stack trace for common known cases where a loop should be aborted.
Use this mechanism to suppress the infamous "FIXME" exception.
Reduce the log level to at most NOTICE, and in some cases DEBUG.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
libcrucible at one time in the distant past had to be a shared library
to force global C++ object initialization; however, this is no longer
required.
Make libcrucible static to solve various rpath and soname versioning
issues, especially when distros try (unwisely) to package the library
separately.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We stopped supporting shared hash tables a long time ago. Remove comments
describing the behavior of shared hash tables.
Add an event counter for pushing a hash to the front when it is already at
the front.
Audited the code for a bug related to bucket handling that impairs space
efficiency when the bucket size is greater than 1. Didn't find one.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Capture SIGINT and SIGTERM and shut down, preserving current completed
crawl and hash table state.
* Executing tasks are completed, queued tasks are paused.
* Crawl state is saved.
* The crawl master and crawl writeback threads are terminated.
* The task queue is flushed.
* Dirty hash table extents are flushed.
* Hash prefetch and writeback threads are terminated.
* Hash table is deallocated.
* FD caches and tmpfiles are destroyed.
* Assuming the above didn't crash or deadlock, bees exits.
The above order isn't the fastest, but it does roughly follow the
shared_ptr dependencies and avoids data races--especially those that
might lead to bees reporting an extent scanned when it was only queued
for future scanning that did not occur.
In case of a violation of expected shared_ptr dependency order,
exceptions in BeesContext child object accessor methods (i.e. roots(),
hash_table(), etc) prevent any further progress in threads that somehow
remain unexpectedly active.
Move some threads from main into BeesContext so they can be stopped
via BeesContext. The main thread now runs a loop waiting for signals.
A slow FD leak was discovered in TempFile handling. This has not been
fixed yet, but an implementation detail of the C++ runtime library makes
the leak so slow it may never be important enough to fix.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We need to replace nanosleeps with condition variables so that we
can implement BeesContext::stop. Export the time calculation from
sleep_for() into a new method called sleep_time().
If the thread executing RateLimiter::sleep_for() is interrupted, it will
no longer be able to restart, as the sleep_time() method is destructive.
This calls for further refactoring of sleep_time() into destructive
and non-destructive parts; however, there are currently no users of
sleep_for() which rely on being able to restart after being interrupted
by a signal.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add a method to have TaskMaster discard any entries in its queue, terminate
all worker threads, and prevent any new Tasks from being queued.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The deadlock seems to be fixed now (if there ever was one--there certainly
were deadlocks, but matching deadlocks to root causes is non-trivial
and a number of distinct deadlock cases have been fixed in recent years).
The benchmark data is inconclusive about whether it is better to fsync or
not to fsync. A paranoia option might be useful here.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The crawl_master task had a simple atomic variable that was supposed
to prevent duplicate crawl_master tasks from ending up in the queue;
however, this had a race condition that could lead to m_task_running
being set with no crawl_master task running to clear it. This would in
turn prevent crawl_thread from scheduling any further crawl_master tasks,
and bees would eventually stop doing any more work.
A proper fix is to modify the Task class and its friends such that
Task::run() guarantees that 1) at most one instance of a Task is ever
scheduled or running at any time, and 2) if a Task is scheduled while
an instance of the Task is running, the scheduling is deferred until
after the current instance completes. This is part of a fairly large
planned change set, but it's not ready to push now.
So instead, unconditionally push a new crawl_master Task into the queue
on every poll, then silently and quickly exit if the queue is too full
or the supply of new extents is empty. Drop the scheduling-related
members of BeesRoots as they will not be needed when the proper fix lands.
Fixes: 4f0bc78a "crawl: don't block a Task waiting for new transids"
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If /bin/sh is bash, the 'type' builtin produces a list of filenames
that match the arguments to $PATH.
If /bin/sh is dash, we get errors like:
/bin/sh: 1: P:: not found
Hopefully having a build-dep on bash is not controversial.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The two files are identical except README.md links to docs/* while
index.md links to *.
A sed script can do that transformation, so use sed to do it.
This does modify a file in git, but this is necessary to make all
the Github views work consistently.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This sequence of log messages isn't clear:
crawl_master: WORKAROUND: Avoiding RO subvol 6094
crawl_master: WORKAROUND: RO root 6094
The first is from a cache miss, and appears wherever a root is opened
(dedupe or crawl). The second is skipping an entire subvol scan, and
only happens in crawl_master.
Elaborate on the second message a little.
Also use the term "root" consistently when referring to subvol tree IDs.
btrfs refers to these objects by (at least) three distinct names: tree,
subvol, and root. Using three different words for the same thing is worse
than using a single wrong word consistently to refer to the same concept.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
After weeks of testing I copied part of a change to main without copying
the rest of the change, leading to an immediate segfault on startup.
So here is the rest of the change: limit the number of
BeesContexts per process to 1. This change was discussed at
https://github.com/Zygo/bees/issues/54#issuecomment-360332529 but there
are more reasons to do it now: the candidates to replace the current
hash table format are less forgiving of sharing hash tables, and it may
even become necessary to have more than one hash table per BeesContext
instance (e.g. to keep datasum and nodatasum data separate).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
https://github.com/Zygo/bees/issues/91 describes problems encountered
when running bees on systems with many CPU cores.
Limit the computed number of threads (using --thread-factor or the
default) to a maximum of 8 (i.e. the number of logical cores in a modern
laptop). Users can override the limit by using --thread-count.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
options.md was a disorganized mess that markdown couldn't parse properly.
Break the options list down into sections by theme. Add the new
'--workaround-btrfs-send' option to the new 'Workarounds' section.
Clean up the rest of the text and fix some inconsistencies.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Introduce --workaround options which trade performance or effectiveness to
avoid triggering kernel bugs.
The first such option is --workaround-btrfs-send, which avoids making any
modification to read-only subvols to avoid btrfs send bugs.
Clean up usage message: no tabs for formatting, split options into
sections by theme.
Make scan mode a non-static data member like all (most?) other options.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We didn't take enough care to fix all invocations of git in this
scenario.
Fixes: 32d2739 ("Makefile: Specify version when building from tarball")
Signed-off-by: Kai Krakow <kai@kaishome.de>
The log message is quite CPU-intensive to generate, and some data sets
have enough hash collisions to throw off benchmarks.
Keep the event counter but drop the log message.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make sure the result set is empty before running the ioctl in case
something tries to consume the result without checking the error status.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we are not zero-filling containers then the overhead of allocating them
on each use is negligible. The effect that the thread_local containers
were having on RAM usage was very non-negligible.
Use dynamic containers (members or stack objects) for better control
of object lifetimes and much lower peak RAM usage. They're a tiny bit
faster, too.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit brings back -O3 but in an overridable way. This should make
downstream distributions happy enough to accept it.
While at the subject, let's apply the same fixup logic to LDFLAGS, too.
This commit also properly gets rid of the implicit rules which collided
too easily with the depends.mk.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Automatically fall back to LOGICAL_INO if LOGICAL_INO_V2 fails and no
_V2 flags are used.
Add methods to set the flags argument with build portability to older
headers.
Use thread_local storage for the somewhat large buffers used by
LOGICAL_INO_V2 (and other users of BtrfsDataContainer like INO_PATHS).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Better toxic extent detection means we can now handle extents with
many more references--easily hundreds of thousands.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Replace CPU shares and IO block weight by CPU weight and IO weight. Note that new parameters are roughly 1/100 of old one--I believe that's the right conversion. Also removed duplicate Nice parameter and alphabetized the parameters for ease of reading.
Faster and more reliable toxic extent detection means we can now be much
less paranoid about creating toxic extents.
The paranoia has significant impact on dedupe hit rates because every
extent that contains even one toxic hash is abandoned. The preloaded
toxic hashes were chosen because they occur more frequently than any
other block contents in typical filesystem data. The combination of these
resulted in as much as 30% of duplicate extents being left untouched.
Remove the preloaded toxic extent blacklist, and rely on the new
kernel-CPU-usage-based workaround instead.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Leave AL16M defined in beesd to avoid breaking scripts based on
beesd.conf.sample which used this constant.
Use the absolute size in beesd.conf.sample to avoid any future problems.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We detect toxic extents by measuring how long the LOGICAL_INO ioctl takes
to run. If it is above some threshold, we consider the extent toxic,
and blacklist it; otherwise, we process the extent normally.
The detector was using the execution time of the ioctl, which detects
toxic extents, but it also detects pauses of the bees process and
transaction commit latency due to load. This leads to a significant
number of false positives. The detection threshold was also very long,
burning a lot of kernel CPU before the detection was triggered.
Use the per-thread system CPU statistics to measure the kernel CPU usage
of the LOGICAL_INO call directly. This is much more reliable because it
is not confounded by other threads, and it's faster because we can set
the time threshold two orders of magnitude lower.
Also remove the lock and mutex added in "context: serialize LOGICAL_INO
calls" because we theoretically no longer need it (but leave the code
there with #if 0 in case we do need it in practice).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
ROOT_TREE contains the ROOT_ITEM for EXTENT_TREE. Every modification
(that we care about) to a btrfs must go through EXTENT_TREE, and must
modify the page in ROOT_TREE pointing to the root of EXTENT_TREE...
which makes that a very good source for the filesystem transid.
Remove the loop and the root lookups, and just look at one item for
max_transid.
Also note that every caller of transid_max_nocache() immediately
feeds the return value to m_transid_re.update(), so don't do that
inside transid_max_nocache().
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It turns out that we do need to scan all the subvols in order
to find transid_max.
Keep the bug fix though.
This reverts commit bf6ae80eee.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesRoots::transid_max_nocache calls btrfs_get_root_transid() which
retrieves the transid of the root of the given Fd. Since the FS_TREE
(subvol 5) is the root of the subvol hierarchy, it will always have
the highest transid on the filesystem, and we do not need to look at
any others.
Also fix a bug where we pass BTRFS_FS_TREE_OBJECTID instead of the
file descriptor root_fd() to btrfs_get_root_transid(). If BEESHOME
is somewhere on the same btrfs filesystem, and there are no leaked FDs
at bees startup, then BTRFS_FS_TREE_OBJECTID (5) usually has the same
integer value as a valid file descriptor of some object on the filesystem
that has a regularly increasing transid value. If Fd 5 happens to be a
file in BEESHOME then bees itself drives the transid increments. This,
combined with the search of all subvol roots, hides the bug (unless Fd
5 gets closed somehow).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesContext::home_fd() is supposed to open $BEESHOME once and cache
the Fd for later calls; however, instead it was reopening a new Fd each
time it was called, and _also_ holding that Fd in a BeesContext member.
Fds clean themselves up when they are forgotten, so it was not leaking
per se, but it certainly had more open Fds than it needed to.
Check to see if we have m_home_fd open, and return that if so.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
LOGICAL_INO can trip over the btrfs slow-backrefs bug, resulting in
some very long in-kernel runtimes. If too many threads are executing
LOGICAL_INO then there may be no cores left on the system to run other
tasks.
Toxic extent detection is done by a very rudimentary algorithm which
can be confused by unrelated sources of latency within btrfs (especially
commit latency). The algorithm can also be confused by other threads
executing the LOGICAL_INO ioctl.
These are two good reasons to prevent any two threads in a single bees
process instance from executing LOGICAL_INO at the same time, so let's
do that.
It is possible to limit the number of threads executing LOGICAL_INO with
the -c and -C options; however, this also limits the number of threads
which can perform any operation, while only LOGICAL_INO (*) has such a
profound effect on the rest of system operation.
Also make the status message clearer about exactly when LOGICAL_INO is
executed, as opposed to merely waiting to acquire a lock before executing
the ioctl.
(*) or maybe FILE_EXTENT_SAME. The problem function that keeps showing
up in kernel stack traces is find_parent_nodes, which is called by both
the LOGICAL_INO and FILE_EXTENT_SAME ioctls. We'll try this change
first and see if it prevents any recurrences of forced watchdog reboots;
if it does not, then we'll limit FILE_EXTENT_SAME the same way.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The ordering function for BeesCrawlState did not consider
root 292 inode 0 min_transid 2345 max_transid 3456
to be larger than
root 292 inode 258 min_transid 2345 max_transid 2345
so when we attempted to update the end pointer for the crawl progress,
the new state was not considered newer than the old state because the
min_transid was equal, but the new crawl state's inode number was smaller.
Normally this is not a problem because subvol scans typically begin
and end in separate transactions (in part because we don't start a
subvol scan until at least two transactions are available); however,
the cleanup code for the aftermath of the recent transid_min() bug can
create crawlers with equal max_transid and min_transid records.
Fix this by ordering both transid fields before any others in the
crawl state.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Due to an earlier bug some beescrawl.dat files will contain uint64_t
max as max_transid. This prevents any further scanning on the subvol
because there is no possibiity of having a real transid (or any other
uint64_t number) larger than uint64_t max.
If we detect a bad transid in beescrawl.dat, log a warning, then use
some more plausible value: either min_transid to repeat the previous
incremental crawl, or 0 to restart the subvol scan from the beginning.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
On a few test machines max_transid on subvols is getting set to
18446744073709551615 (aka uint64_t max).
Prevent transid_min() from ever returning this value.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
"saved" is used only during hash table correctness analysis, which is
normally not enabled at compile time, and requires source modification
to enable.
Remove the pointless copy and save a tiny bit of CPU.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The 16MB hash table extent size did not serve any useful defragmentation
or compression purpose, and for very small filesystems (under 100GB),
16MB is much larger than necessary.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
systemd-coredumpctl collects core files for later analysis
with gdb. It's a convenient thing if the keys you use to encrypt
/var/lib/systemd/coredump are the same as the keys you use to encrypt
the filesystem where you're running bees.
Add it to the documentation just before the hand-rolled version.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Standard crash backtrace collection, plus $BEESSTATUS for the high-level
overview of what bees is doing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Split the rather large README into smaller sections with a pitch and
a ToC at the top.
Move the sections into docs/ so that Github Pages can read them.
'make doc' produces a local HTML tree.
Update the kernel bugs and gotchas list.
Add some information that has been accumulating in Github comments.
Remove information about bugs in kernels earlier than 4.14.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When package maintainers build from a tarball, the .git directory does
not exist to extract the version tag. Let's add a hack to work around
this issue and let them specify `BEES_VERSION="v0.y"` on the make
cmdline.
Github-Bug: https://github.com/Zygo/bees/issues/75
Signed-off-by: Kai Krakow <kai@kaishome.de>
Gentoo has officially merged the ebuild into portage as of:
https://github.com/gentoo/gentoo/pull/9925
Let's update the readme and get rid of the `contrib/gentoo-bees`
directory, so we have no potentially outdated information in the future.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Now that the packaging preparations were merged, we should update the
ebuild to reflect the upstream master branch.
Signed-off-by: Kai Krakow <kai@kaishome.de>
ExtentWalker doesn't gain significant benefits from caching, and the
extra SEARCH_V2 ioctls were blamed for a 33% kernel CPU overhead by perf.
Reduce the number of extents to 16 in lieu of fixing the caching.
This gives a significant speed boost on CPU-bound workloads compared
to the original 1024--almost 40% faster on a single SSD with a filesystem
consisting of raw VM images mounted with compress=zstd.
This also seems to reduce LOGICAL_INO overhead. Perhaps SEARCH_V2 and
LOGICAL_INO were trying to lock the same extents, and interfering with
each other?
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
`grep -q something | grep -q something_else` will never find anything.
The for-loop is redundant anyways because `grep -l` can already work for
us. Let's replace this with a shorter and working version.
CC: Timofey Titovets <timofey.titovets@synesis.ru>
(fixes: commit 06d41fd "Rewrite beesd arg parser")
Signed-off-by: Kai Krakow <kai@kaishome.de>
The -g option limits the number of worker threads when the target load
average is exceeded. On some systems the load normally runs high, and
continuous bees operation is required to avoid running out of disk space.
Add a -G/--thread-min option to force at least some threads to continue
running.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The task queue may already be full of tasks when the crawl task is
executed. In this case simply reschedule the crawl task at the
end of the current queue.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add -g / --loadavg-target parameter to track system load and add or
remove bees worker threads dynamically to keep system load close to the
loadavg target. Thread count may vary from zero to the maximum
specified by -c or -C, and is adjusted every 5 seconds.
This is better than implementing a similar load average scheme from
outside of the process (though that is still possible) because the
in-process load tracker does not disrupt the performance timing feedback
mechanisms as a freezer cgroup or SIGSTOP would when controlling bees
from outside. The internal load average tracker can also adjust the
number of active threads while an external tracker can only choose from
the maximum or zero.
Also fix a bug where a Task could deadlock waiting for itself to exit
if it tries to insert a new Task after the number of worker threads has
been set to zero.
Also correct usage message for --scan-mode (values are 0..2) since
we are touching adjacent lines anyway.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Other btrfs utils use readahead() not posix_fadvise().
There does not appear to be a performance or correctness difference
between the three (none, posix_fadvise, or readahead()).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Log messages were already labelled with log levels, but there was no
way to filter by log level at run time.
Implement the filter inside the bees process so it can skip evaluation
of the BEESLOG* arguments if the log messages would not be emitted.
Fixes: https://github.com/Zygo/bees/issues/67
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When BEESLOGINFO is called multiple times it generates separate log
records that can be mixed up when multiple threads dedup.
Use a single BEESLOGINFO call for each dedup to prevent this.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
set() was broken and redundant. Calling hold() and discarding the
returned object has the correct effect.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit squashes all the little changes from the previous
integration branch into one, adjusts to the new Makefile changes, and
introduces an overlay layout so that the contrib/gentoo-bees subtree
can be directly added as a Portage overlay to the system.
The following list contains the previous commit descriptions:
sys-fs/bees: Keyword tested architecture ~amd64
Bees was tested on this platform.
sys-fs/bees: Add kernel version checks
Add checking the kernel versions and write some info and/or warnings
before building and installing the package. Running bees on older
kernels may have some serious performance and stability impacts, let's
tell the user about it.
Closes#55
sys-fs/bees: Add metadata.xml
sys-fs/bees: There's no configure script
So, there's no point in calling "default".
sys-fs/bees: Simplify src_configure()
sys-fs/bees: Don't depend on markdown
It makes no sense to install both README.md and README.html, and we can
get rid of one dependency.
Dependencies: btrfs-progs is no longer a buildtime-only dep
It is actually needed by the bees service wrapper script, as pointed out
by Gentoo QA review.
sys-fs/bees: DOCS is not needed
"COPYING" is already covered by the licensing. The ebuild defaults
already include README*
sys-fs/bees: Make warnings exclusive
It was recommended by Gentoo QA to show only either one or another
warning, and change the texts accordingly.
sys-fs/bees: RDEPEND is not implicit
RDEPEND does not implicitly default to DEPEND. Let's explicitly set the
variable.
sys-fs/bees: IUSE=test is only needed for explicit dependencies
Thus, remove it.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Make life easier for package maintainers by not forcing architecture or
compiler optimizations by default. E.g., Gentoo QA refuses to accept
both "-march=native" and "-O3". These are usually provided by the
package tooling.
Instead, we provide easily accessible templates in "makeflags".
Signed-off-by: Kai Krakow <kai@kaishome.de>
This forces us to depend on markdown which would be otherwise optional.
Most of the time it is sufficient to let package managers just install
the README.md file.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Due to VPATH and how make resolves source paths, libcrucible.so ends up
with a hard-coded path to link against libuuid.so. Let's fix it by
turning the general rule into an explicit rule for libcrucible.so.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Since systemd prefix it's own timestamps, we can unconditionally remove
timestamps when bees is executed by systemd.
Signed-off-by: Kai Krakow <kai@kaishome.de>
We should probably not put it into the objects list. Let's instead
explicitly put it as a depend of libcrucible.so.
This allows us to not use *.cc as a depend for .version.cc which makes
more sense as CRUCIBLE_OBJS is also explicitly defined and not built
from wildcards.
Signed-off-by: Kai Krakow <kai@kaishome.de>
This commit adds support for putting package configuration options into
header files. This is needed to prepare reading config files from /etc.
Signed-off-by: Kai Krakow <kai@kaishome.de>
This commit removes USR_PREFIX and introduces ETC_PREFIX instead. The
purpose of PREFIX is the installation prefix in the system, not the
installation destination. The latter one is what DESTDIR is used for.
This should clear up the confusion. PREFIX was already mis-used as
installation destination. But that doesn't mix well with how the make
targets are designed.
CC: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: Kai Krakow <kai@kaishome.de>
There's now a new make target called "install_tools" which would not run
by default on installation.
One can add "OPTIONAL_INSTALL_TARGETS=install_tools" into localconf to
install these by default.
fiewalk would be installed to sbin, as only root can run it, the other
goes to bin.
Gentoo can use this to optionally install these tools as a package
feature.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Instead, introduce "make reallyall" and make it the default target. Now,
one can override the default target using localconf.
Needed for preparing Gentoo ebuild test behavior.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Also split "bad feature interactions" into "unknown" (which is what it
really was before) and "bad" (which includes some filesystem-destroying
problems).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Linux kernel 4.14, while resistant to extent toxicity, is not immune to it.
Go back to the paranoid setting to avoid tying up filesystems in
ridiculously long kernel loops in find_parent_nodes.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
An empty BeesBlockData from the chasing algorithm used to mean that data
was found at the expected location but it does not match; however, there
are now other reasons for this and they occur much more often. The name
is misleading.
Change the name to report more correctly what happens: no data, without
any guess about the reason.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The task queue can become very large with many subvols, requiring hours
for the queue to clear. 'beescrawl.dat' saves in the meantime will save
the work currently scheduled, not the work currently completed.
Fix by tracking progress with ProgressTracker. ProgressTracker::begin()
gives the last completed crawl position. ProgressTracker::end() gives
the last scheduled crawl position. begin() does not advance if there
is any item between begin() and end() is not yet completed. In between
are crawled extents that are on the task queue but not yet processed.
The file 'beescrawl.dat' saves the begin() position while the extent
scanning task queue is fed from the end() position.
Also remove an unused method crawl_state_get() and repurpose the
operator<(BeesCrawlState) that nobody was using.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When both block candidates for dedup are located in the same extent, bees
excludes them from deduplication because the dedup operation would not
free any space (both blocks are still referenced, so neither is deleted).
Candidates in other extents are still considered.
Typically a few blocks are duplicated many thousands or even millions
of times within a filesystem. Many of these blocks appear in the same
extent as each other. In cases where an extent contains an extremely
common duplicate block, it may appear multiple times in many extents.
bees can get into a loop with a very bad worst-case running time: 32768
blocks per extent * 2560 bees reference limit * 256 distinct hash table
entries = 21.5 *billion* iterations...squared, because this loop happens
every time bees encounteres any of the references. Not an infinite
number, but close enough.
In each iteration of the loop, replace_dst detects that both src and dst
block are part of the same btrfs extent data item and therefore should
not be deduped; however, this occurs after the block has been allocated
and read by chase_extent_ref. This dst is discarded, but the outer
loop tries again with another reference to the same block and gets the
same result.
An easy fix for this problem is to stop the loop immediately when the
same physical extent is found in both src and dst. The condition is rare
enough to ignore the negligible space efficiency loss, and filesystem
scan stops dead if the loop is allowed to proceed. An exception is
thrown to terminate the loop at scan_one_extent from within replace_dst.
It would be better to determine the extent bytenr of each candidate
extent and filter them out in scan_one_extent (which reduces the number
of LOGICAL_INO calls as a side-effect), but bees has no code capable of
doing extent data tree lookups with backward iteration yet. Even better
would be to change the hash table format so that the extent bytenr can
be decoded directly from the hash table entry (this already exists for
compressed extents). Both of these changes are too large for v0.6.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Clearing the FD cache could trigger a lot of inode evicts in the kernel,
which will block the cache entry destructors called by map::clear().
This prevents any cache lookups or new file opens while it happens.
Move the map to an auto variable and destroy it after releasing the
mutex lock. This probably has the same net result (all the bees threads
will be blocked in the kernel instead of on a bees mutex), but at least
the problem is outside of userspace now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
One very common case is losing a race to open a file that was deleted.
No need to spam the logs with mere ENOENT reports.
Other errors are more significant. Log those with errno, and
add event counters to record them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Extents that extend past EOF will have ipos = (file size rounded up
to next block) and e.end() = (file size not rounded), which fails this
constraint check.
The constraint check is wrong. Remove it for now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The previous commit had both max_transid assigments commented out.
It happens to work because we set max_transid in the constructor and
it doesn't change after that, but it's cleaner to assign it explicitly.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When an extent ref is modified, all of the refs in the same metadata
page get the same transid in the TREE_SEARCH_V2 header. This causes
two problems:
- Extents with generation < min_transid are included if they
happen to be referenced by pages with generation >= min_transid.
- Extent refs with generation > max_transid are excluded even
if they reference extents with generation <= max_transid.
Both of these are wrong: the first causes some extents to be repeatedly
scanned, the second causes some extents to not be scanned at all.
Change the TREE_SEARCH_V2 parameters so that Crawl sees all extents
newer than min_transid (i.e. set max_transid to max). The TREE_SEARCH_V2
kernel logic already operates this way, i.e. it fetches every page with
transid >= min_transid and discards newer items if they are too new for
max_transid. Filter strictly by the extent reference generation field
(i.e. the copy of the extent generation that is in the extent reference).
Note this still scans extent data multiple times, but it should now
be exactly once per extent reference. A proper fix for this requires
extent-based scanning instead of extent-ref-based scanning.
Formerly commit 5a8c655fc4 "roots: filter
out obsolete extents from extent refs" which landed in the subvol-threads
branch but not master.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This adds a .txt Makefile target to create a text file which receives
the test program output. In case the test failed, it will cat the
contents and fail the target.
Execution of each test itself is forced, so it would run every time make
is invoked, thus no failing test would be missed.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Scan the roots tree directly for roots other than 5 (the FS root), and
use btrfs_get_root_transid on root_fd for root 5. This avoids filling
up the root FD cache every time we want a new transid_max. Now the only
reason we open a subvol root FD is to open a file within the subvol.
transid_max may be the same as the FS root's transid, in which case
the search loop is not necessary. Place a counter (transid_max_miss)
to see if we ever need to look at root items. If this counter never goes
above zero, or does so very rarely, we can delete the search loop.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Task should not block for extended periods of time.
Remove the RateEstimator::wait_for() in crawl_roots. When crawl_roots
runs out of data, let the last crawl_task end without rescheduling.
Schedule crawl_task again on transid polls if it was not already running.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BEESLOGNOTE was intended to combine BEESLOG and BEESNOTE, i.e. write a
log message and set the task status message from a single expression.
With the log levels we would now need several more variants
(BEESLOGNOTEDEBUG, BEESLOGNOTEERR...) or a parameter (BEESNOTELOG(DEBUG,
...)).
Or we give up on the idea. This combination was used only 3 times so far.
The log messages and the note message have different editorial styles.
Remove the three instances of BEESLOGNOTE, and make the BEESLOGNOTE
definition equvalent to BEESLOG at LOG_NOTICE level for consistency.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The default constructor makes it more convenient to use Task as a
class member.
The ID is useful to disambiguate Task references.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
update_monotonic does not reset the counter if a new count is smaller than
earlier counts. Useful when consuming an unsorted stream of eveent counts.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Reword log message for discovery of new toxic extents vs. lookup of
previously known toxic extents. Also add the block data (especially
filename) to the discovery message.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
No public version of bees ever created old-style compressed hash table
entries. Remove the code that supports them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add a third scan mode with alternative trade-offs.
Benefits: Good sequential read performance. Avoids race conditions
described in https://github.com/Zygo/bees/issues/27. Avoids diverting
scan resources into short-lived snapshots before their long-lived
origin subvols are fully scanned.
Drawbacks: Takes the longest time of the three implemented scan-modes
to free space in extents that are shared between snapshots. Uses the
maximum amount of temporary space.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Duplicated code between the different scan modes has slowly been
becoming less and less trivial. Move the code to a method and
make both scan-modes call it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Perf was blaming more than 50% of cycles on TREE_SEARCH_V2. strace
showed 4 TREE_SEARCH_V2 calls for every pread in grow_backward().
Fix by increasing the extent fetch batch size so it is more likely
to include the desired items in the first fetch attempt.
This removes TREE_SEARCH_V2 from the top 10 list of cycle consumers.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Prealloc extent sizes were taken from the Extent object and did not
take the file size into account. If a file with a non-4K-aligned
size is preallocated, the resulting dedup fails with an exception
because the size of both ranges of the BeesRangePair do not match.
Limit the size of the replacement hole extent to not extend past the
end of the file.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Restartng scans for each transid is a bit aggressive. Scan every 10
transids for a polling rate close to the former BEES_COMMIT_INTERVAL.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
transid_max is now measured at a single point in the crawl_transid thread.
Move the Crawl deferred logic into BeesRoots so it restarts all crawls
when transid_max increases. Gets rid of some messy time arithmetic.
Change name of Crawl thread to "crawl_master" in both thread name and
log messages.
Replace "Next transid" with "Crawl started".
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The periodic cache age check was not protected by a lock, so multiple
threads may decide to concurrently clear the cache. This led to
duplicate log messages.
Fix by moving the cache expiry trigger out of FdCache and into Roots,
which knows when transids change and can perform cache clears at exactly
the time they are most relevant, i.e. after something that was deleted
becomes permanently so.
This removes the last references to BEES_COMMIT_INTERVAL, so get rid
of its definition too.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Now that the polling interval is up to 30 times faster,
next_transid seems too verbose again.
Make it clearer that the interval quoted in the "Deferring..."
message is the computed transaction polling interval.
Combine "Next transid" and "Restarted crawl" into a single message.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make the crawl polling interval more closely track the commit interval
on the btrfs filesystem. In the future this will provide opportunities
to do things like clear FD caches and stop crawls on deleted subvols,
but triggered by transaction commits instead of arbitrary time intervals.
Rename the "crawl" thread so it no longer has the same name as the "crawl"
task, and repurpose it for dedicated transid polling. Cancel the deletion
of crawl_thread and repurpose it to trigger new crawls and wake up the
main crawl Task when it runs out of data.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
RateEstimator estimates the rate of external events by sampling a
counter.
Conversion functions are provided to predict the time when the
event counter will be incremented to particular values based on past
observations of the event counter.
Synchronization functions are provided to block a thread until a specific
counter value is reached.
Event polling is supported using the history of previous event counts
to determine the predicted time of the next event. A decay function
emphasizes more recent event history.
Polling delays are bounded by minimum and maximum values in the constructor
parameters.
wait_for() and wait_until() block the calling thread until the target
event count is reached (or the counter is reset). These functions are
not bounded by min_delay or max_delay, and require a separate tread
to call update(). wait_for() waits for the counter to be incremented
from its current value by the given count. wait_until() waits for the
counter to reach an absolute value.
update() counts external events and unblocks threads that are blocked
in wait_for() or wait_until(). If the event counter decreases then it
is reset to the new value.
duration() and time_point() convert relative and absolute event counts
into relative and absolute C++11 time quantities based on the last update
time, last observed event count, and the observed event rate.
Convenience functions seconds_for() and seconds_until() calculate
polling delays for for the desired relative and absolute event counts
respectively. These delays are bounded by max and min delay parameters.
rate() and ratio() provide conversion factors based on the current
estimated event rate.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Fix discussion of nodatasum files, clarifying what we can and cannot do.
Get rid of some BEESNOTE and BEESTRACE calls which cannot be observed
(well, BEESNOTE can, but you have to be quick!).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Having too many "write a message to the log" primitives is confusing,
and having one that intermittently and silently discards output is even
_more_ confusing.
Replace all BEESINFO with appropriate BEESLOG*s. Usually DEBUG.
Except for one or two that occur too often. Just delete those.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add the new WARN_ON bug in v4.14.
Clarify what happens when bees is run on a kernel that is too old.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The data field of BeesBlockData is only interesting to those who want
to debug the BeesBlockData implementation or other battle-tested parts
of bees. Users who want to do this can modify and rebuild the source
to enable the output.
To everyone else, the data field is a huge, ongoing infoleak through
the log.
Don't bother with an option, just output the length of the data field
and nothing else.
Fixes: https://github.com/Zygo/bees/issues/53
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Since we are now unconditionally rendering the print_fn as a static
string, there is no need for it to be a function. We also need it to
be brief and mostly constant.
Use a string instead. Put the string before the function in the Task
constructor arguments so that the title string appears as a heading in
code, since we are making a breaking API change already.
Drop TASK_MACRO as it is broken by this change, but there is no similar
usage of Task anywhere to make it worth fixing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Move pthread_setname_np to the same place we do pthread_getname_np.
Detect errors in pthread_getname_np--but don't throw an exception
because we would call ourself recursively from the exception handler
when it tries to log the exception.
Fix the order of set_name and the first BEESNOTE/BEESLOG call in threads,
closing small time intervals where logs have the wrong thread name,
and that wrong name becomes persistent for the thread.
Make the main thread's name "bees" because Linux kernel stack traces use
the pthread name of the main thread instead of the name of the process.
Anonymous threads get the process name (usually "bees"). We should not
have any such threads, but we do. This appears to occur mostly during
exception stack unwinding. GCC/pthread bug?
Fixes: https://github.com/Zygo/bees/issues/51
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Tests could now be run in parallel. Additionally, single tests can be
run by simply using "make testname", i.e. "make chatter" would run the
chatter test.
Signed-off-by: Kai Krakow <kai@kaishome.de>
According to gcc docs, -l is converted to a filename which makes it a
filename parameter. Let's move it to the end.
Signed-off-by: Kai Krakow <kai@kaishome.de>
When timestamps are removed from logging, the current text layout shows
lines like
tid 12345 thread_name: Example log
Let's convert it to a more conforming layout:
thread_name[12345]: Example log
Signed-off-by: Kai Krakow <kai@kaishome.de>
When a Task worker thread is executing a Task, the thread name is less
useful than the Task description.
Use the Task description instead of the thread name if the thread has
no BeesThread name and the thread is currently executing a task.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Threads from the Task module in libcrucible don't set BeesNote::tl_name.
Even if they did, in Task context the thread name is unspecific to the point
of meaninglessness.
Use the Task::print method as the name for such threads, and be sure
that future Task print functions are designed for that usage.
The extra complexity in BeesNote::get_name() seems preferable to
bombarding pthread_setname_np hundreds or thousands of times per second.
FIXME: we are now calling Task::print() on every BeesNote, which
is effectively unconditionally. Maybe we should have Task::print()
and get_name() return a closure, or just evaluate Task::print() once
and cache it in TaskState, or define Task's constructor with a string
argument instead of the current print_fn closure.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This enables bees' thread introspection to use task descriptions in
status and log messages.
BeesNote will be calling Task::current_task() from non-Task contexts,
which means we need to allow Task's shared state pointer to be null.
Remove some asserts that will ruin our day in that case.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Silence the three(!) log messages per crawl increment an extra one at
the end of the subvol.
The three critical messages per subvol crawl cycle are:
Next transid in BeesCrawlState <SUBVOL>:0 offset 0x0 transid <A>..<B> started <T> (<AGO>s ago)
Subvol has been completely scanned and a new transaction range will
be created. CrawlState is the state of the old subvol.
Restarted crawl BeesCrawlState <SUBVOL>:0 offset 0x0 transid <B>..<C> started <T+AGO> (0s ago)
Subvol has been restarted. CRawlState is the state of the new subvol.
Deferring next transid in BeesCrawlState <SUBVOL>:0 offset 0x0 transid <B>..<C> started <T+AGO> (0s ago)
Subvol has been completely scanned, but it is too soon to start a
new scan.
Fix the "Restart..." message to use the correct verb tense and to use
the correct BeesCrawlState data.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When we find a matching block we attempt to extend ("grow") the matched
pair around the first matching block. This function takes the IO hit of
reading the second extent from each duplicate extent pair. It's also
very slow--too many allocations, too small reads, reads in the wrong
order, an order of magnitude too many calls to TREE_SEARCH_V2, and it
is usually in the top 3 most frequent PERFORMANCE warnings.
Start tracking the running time of grows using the pairforward_ms
and pairbackward_ms counters so that we can compare it to various
replacements.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit adds log levels to the output. In systemd, it makes colored
lines, otherwise it's probably just a number. Bees is very chatty, so
this paves the road for log level filtering.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Dependencies can be generated in parallel which can be much faster. It
also puts away the problem that for may fail multiple times in a row and
leaving behind a broken intermediate file which would be picked up by
successive runs.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Let's generalize the depends.mk target so we can easily move files
around later. While doing it, let's also fix the "gcc -M" call to use
explicit target names and not clobber it with preprocessor output.
Signed-off-by: Kai Krakow <kai@kaishome.de>
We can remove the explicit depend on the .h file because that is covered
by depends.mk. Let's instead depend on makeflags which makes more sense.
Signed-off-by: Kai Krakow <kai@kaishome.de>
We need a better cache expiration algorithm than "make a copy of
the entire thing, sort it while holding a lock, and delete half
the items in a single burst."
Replace the Lamport clock with a double-linked list. Each insert
or lookup operation moves the affected item to the head of the list.
Each erase operation deletes one single item at the tail of the list.
Also sort out some iterator invalidation nonsense by doing erases before
inserts instead of "insert, erase, find the inserted item again because
we invalidated the found iterator during the erase."
The new implementation adds a second word-sized member to each Value
as well as a copy of the Key. Hopefully the enlarged size is not
a deal-breaker.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The mlock runs much faster, probably because the hash fetches are
doing most of the work that mlock does.
It makes bees startup latency for testing smaller, even if it takes more
time in absolute terms.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Include a brief description of the two algorithms without getting
into too much detail for an ostensibly temporary feature.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
There are two subvol scan algorithms implemented so far. The two modes
are unimaginatively named 0 and 1.
0: sorts extents by (inode, subvol, offset),
1: scans extents round-robin from all subvols.
Algorithm 0 scans references to the same extent at close to the same
time, which is good for performance; however, whenever a snapshot is
created, the scan of the entire filesystem restarts at the beginning of
the new snapshot.
Algorithm 1 makes continuous forward progress even when new snapshots
are created, but it does not benefit from caching and will force the
kernel to reread data multiple times when there are snapshots.
The algorithm can be selected at run-time using the -m or --scan-mode
option.
We can collect some field data on these before replacing them with
an extent-tree-based scanner. Alternatively, for pre-4.14 kernels,
we can keep these two modes as non-default options.
Currently these algorithms have terrible names. TODO: fix that, but
also TODO: delete all that code and do scans directly from the extent
tree instead.
Augment the scan algorithms relative to their earlier implementation by
batching multiple extents to scan from each subvol before switching to
a different subvol.
Sprinkle some BEESNOTEs on the Task objects so that they don't
disappear from the thread status output.
Adjust some timing constants to deal with the increased latency from
competing threads.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Distribute incoming extents across a thread pool for faster execution
on multi-core, multi-disk environments.
Switch extent enumeration model to scan extent refs consecutively(ish).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In both instances the code contained within (or the conditional
compilation surrounding it) is no longer controversial.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We need a mechanism for distributing work across processor cores and
disks.
Task implements a simple FIFO/LIFO queue model for executing closures.
Some locking primitives are included (mutex and barrier).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Bees will someday rely on features available only in kernel v4.14.
Let's start now by removing workarounds for bugs that were fixed in v4.11.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Remove some dead code because dedup-related deadlocks have not been
observed since Linux kernel v4.11.
Preserve rationale of remaining #if 0 block (why we do write/rename
instead of write/fsync/rename) so that people don't try to replace the
"missing" fsync() there.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
With kernel 4.14 there is no sign of the previous LOGICAL_INO performance
problems, so there seems to be no need to throttle threads using this
ioctl.
Increase the FD cache size limits and scan thread count. Let the kernel
figure out scheduling.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BEESNOTE puts a message on the status message stack. BEESINFO logs a
message with rate limiting. The message that was flooding the logs
was coming from BEESINFO not BEESNOTE.
Fix earlier commit which removed the wrong message.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We were holding weak refs until the next time the resource ID was used.
This is a bad thing if resource IDs are sparse (e.g. pointers or hashes)
because we'll never see an ID twice.
To fix, determine whether we released the last instance of a resource,
and if so, free its weak ref immediately.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The bugs in other parts of the code have been identified and fixed,
so the overprotective locks around shared_ptr can be removed.
Keep the other improvements to the Resource class.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This avoids PERFORMANCE warnings when large hash tables are used on slow
CPUs or with lots of worker threads. It also simplifies the code (no
locksets, only one object-wide mutex instead of two).
Fixed a few minor bugs along the way (e.g. we were not setting the dirty
flag on the right hash table extent when we detected hash table errors).
Simplified error handling: IO errors on the hash table are ignored,
instead of throwing an exception into the function that tried to use the
hash table.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit adds an ebuild for Gentoo. Version 9999 is building live
from current git, currently using kakra:integration because it has some
installation and build fixes important for Gentoo.
Signed-off-by: Kai Krakow <kai@kaishome.de>
According to Gentoo packaging guide, -fPIC should only be used on shared
libraries, and not added unconditionally to every linker call.
Signed-off-by: Kai Krakow <kai@kaishome.de>
In preparation for Gentoo QA checks during ebuild merge phase, let's
make some more of the filesystem layout adjustable.
Signed-off-by: Kai Krakow <kai@kaishome.de>
In Gentoo, usage of DESTDIR is automatically handled by the build system
to support installation into a clean image from which the package is
created.
Thus, let's add DESTDIR to the install targets. One can now correctly
install bees with packaging systems simply by running:
$ DESTDIR=/tmp/bees-image make all install
This will no longer mess up with the PREFIX setting.
CC: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: Kai Krakow <kai@kaishome.de>
If using `scripts/beesd`, we need `blkid` which is part of util-linux.
It should be available on every distribution but let's document it
anyway.
Signed-off-by: Kai Krakow <kai@kaishome.de>
It happened more than once that I ran just "make install" only, which
doesn't install the scripts.
Let's fix this by renaming the previous install target to install_bees,
and then make a new install target which depends on each install target
and thus installs the complete distribution.
It doesn't hurt to install those few scripts. I don't see the point in
separating the install targets as it was previously done.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Let's direct users to the support site when they ask systemd for help
about the service unit, or by looking at error messages.
Also, let's adjust the description to be more pleasing to the eyes. The
previous long description with uncommon formatting really stuck out in
the boot logs.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Starting bees right after local-fs.target is probably not what we want,
as basic setup of the system might not have been done (like udev,
cryptsetup, sysctl, swap, etc).
Let's start only after sysinit.target instead which guarantees that all
basic setup has been done, most importantly, sysctl, udev, and swap have
been setup which may apply important tweaks, configuration, and tuning.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Due to bees installing into the local-fs.target, bees also runs during
system-update.target. This should not be done, system-update.target is
meant as an isolated bootup mode for applying updates offline, that is:
Only essential services are running.
Fix this by making it WantedBy basic.target instead. According to
system-update.target and "man bootup", system-update.target pulls in
sysinit.target, as does basic.target. So essentially, basic.target is
not part of the system-update.target transaction.
Signed-off-by: Kai Krakow <kai@kaishome.de>
If two utilities are found, we get commands like
/usr/bin/markdown /usr/bin/markdown_py README.md > README.html
and that doesn't work.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Read-only snapshots have always just worked. Remove them from the
"untested" list.
nodatasum (and therefore nodatacow) inodes are simply ignored. This seems
like the right thing to do since deduping a nodatacow extent turns it
into a datacow extent, which seems contrary to administrator wishes
implied by the nodatacow bit. We probably need an option to
override that assumption.
Clarify why converted ext[234] filesystems may cause problems and
the nature of those problems.
Assorted minor editorial changes.
Discuss calculation of the balance limit parameter when ensuring
sufficient metadata space.
Update kernel version bug/fix/feature lists, including LOGICAL_INO_V2.
Annotate kernel workaround list with known kernel versions that make
the workarounds necessary.
Remove reference to 'DEFRAG_RANGE' as bees requires much more control
over data placement than this interface can offer. It's easy enough to
create a new ioctl to implement bees requirements once it's known what
those requirements are.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When a toxic extent is discovered, insert the offending hash/address/toxic
entry into the hash table.
When a previously discovered toxic extent is encountered, do nothing,
i.e. allow the offending hash/address/toxic entry in the hash table
to expire.
Previously both inserts were removed from the code, but the former one
is required. The latter prevents bees from forgiving toxic extents
(or any hash matching one) should they be relocated, deleted, or simply
become non-toxic.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Interesting things happen when blindly swapping the release-build CCFLAGS
with the debug-build commented-out CCFLAGS. None of these things that
happen are good.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Currently scheme lead to path like:
/tmp/makepkg/bees-git/pkg/bees-git/usr/lib/bees/bees
While packaging, so allow do:
make
make scripts
make install ...
make install_scripts ...
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
GCC 7 and higher turn a previous warning into an error for implicit
fallthrough. Let's hint the compiler that this is intentional here.
Signed-off-by: Kai Krakow <kai@kaishome.de>
(cherry picked from commit 270a91cf17)
GCC 7 and higher turn a previous warning into an error for implicit
fallthrough. Let's hint the compiler that this is intentional here.
Signed-off-by: Kai Krakow <kai@kaishome.de>
GCC 7 and higher turn a previous warning into an error for implicit
fallthrough. Let's hint the compiler that this is intentional here.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Now with the patches integrated to filter logging output, we can finally
remove forking a subprocess and stop redirecting file descriptors.
We instead use exec to replace the process with the final daemon.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Use a static function instead of embedding side-effects in the constructor
of an unrelated class.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 85106bd9a9)
To make bees more friendly to use with syslog/systemd, we add an option
to omit timestamps from the log output.
Signed-off-by: Kai Krakow <kai@kaishome.de>
This commit adds a simple getopt options parser to show help. This can
be used as a boilerplate for adding more options later.
Signed-off-by: Kai Krakow <kai@kaishome.de>
To install for different distributions, LIBEXEC_PREFIX can now be set.
It defaults to $(PREFIX)/usr/lib/bees as used in most common
distributions.
Local overrides are possible by setting variables in a "localconf" file
which will be included by the Makefile if it exists.
For some distributions you may want to set it to /usr/libexec or
/usr/libexec/bees.
Let's remove the CPUQuota example and instead give bees a share of
what's available.
128 CPU shares will give it about 12% max CPU under load, give it a
slight boost during startup to allow reading the hash table faster.
100 block shares will give it about 10% max disk bandwidht under load,
give it a slight boost during startup to allow reading the hash table
faster.
Then let's adjust the CPU and IO scheduler to prefer other processes.
This way bees runs completely in the background, barely noticable
during, e.g., gaming.
Explicitly set control-group kill mode, that is: try SIGTERM first, and
use SIGKILL after a timeout. This exactly defines how bees is running as
a child process within the frontend service starter. Not sure if bees cares
about signals but SIGTERM first seems cleaner. On the way, let bees restart
on abnormal termination.
After a few hundred subvol threads start running, the inode cache starts
to thrash, and the log gets spammed with messages of the form:
"open_root_nocache <subvolid>: <path>"
Ideally there would be some way to schedule work to minimize inode
thrashing. Until that gets done, just silence the messages for now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
With many threads it is inconvenient to reassemble the elided parts of
the dedup src/dst and scan filenames output. Simply output them
unconditionally, and balance the line lengths.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we lose a race and open the wrong file, we will not retry with the
next path if the file we opened had incompatible flags. We need to keep
trying paths until we open the correct file or run out of paths.
Fix by moving the inode flag check after the checks for file identity.
Output attributes in hex to be consistent with other attribute error
messages.
There is no need to report root and file paths separately in the error
message for incompatible flags because we have confirmed the identity of
the file before the incompatible flag error is detected. Other messages
in this loop still output root path and file_path separately because
the identity of 'rv' is unknown at the time these messages are emitted.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If you have a lot of or a few big nocow files (like vm images) which
contain a lot of potential deduplication candidates, bees becomes
incredibly slow running through a lot "invalid operation" exceptions.
Let's just skip over such files to get more bang for the buck. I did no
regression testing as this patch seems trivial (and I cannot imagine any
pitfalls either). The process progresses much faster for me now.
Keep track of the locking thread so we can see why we are deadlocked
in gdb.
Use a handle type for locks based on shared_ptr. Change the handle type
name to flush out any non-auto local variables.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit aa0b22d445)
This helps identify causes of the "same physical address in dedup"
exception.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit cc7b4f22b5)
BLOCK_SIZE_MIN_EXTENT_DEFRAG, BLOCK_SIZE_MIN_EXTENT_SPLIT, and others
are no longer used. Remove them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit a3d7032eda)
Add time spent in file create and copy operations to the stats.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit f01c20f972)
A BEESTRACE closure could throw an exception. Trap those so we don't
end up in terminate().
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 59660cfc00)
Reads can block indefinitely due to bugs, low io priority, or poor
storage performance. Record the block origin data in the thread state
so we can see which reads are problematic.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit f56f736d28)
Use () instead of [] when the respective end of the byte range touches
the beginning or end of the file. Also omit the '0' at beginning of
file.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 3023b7f57a)
Use a different character to make it easier to search for bytenr ranges
in the logs.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit d43199e3d6)
This will allow the default size limit for cache objects to be changed
with impunity.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 9daa51edaa)
perf blames the SEARCH_V2 ioctl wrapper for a lot of time spent in malloc.
Use a thread_local buffer for ioctl results, and reuse it between runs.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit e509210428)
In gcc 7+ warning: implicit-fallthrough has been added
In some places fallthrough is expectable, disable warning
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Holding file FDs open for long periods of time delays inode destruction.
For very large files this can lead to excessive delays while bees dedups
data that will cease to be reachable.
Use the same workaround for file FDs (in the root_ino cache) that
is used for subvols (in the root cache): forcibly close all cached
FDs at regular intervals. The FD cache will reacquire FDs from files
that still have existing paths, and will abandon FDs from files that
no longer have existing paths. The non-existing-path case is not new
(bees has always been able to discover deleted inodes) so it is already
handled by existing code.
Fixes: https://github.com/Zygo/bees/issues/18
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
check_overflow() will invalidate iterators if it decides there are too
many cache entries.
If items are deleted from the cache, search for the inserted item again
to ensure the iterator is valid.
Increase size of timestamp to size_t.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Some whitespace fixes. Remove some duplicate code. Don't lock
two BeesStats objects in the - operator method.
Get the locking for T& at(const K&) right to avoid locking a mutex
recursively. Make the non-const version of the function private.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Before:
unique_lock<mutex> lock(some_mutex);
// run lock.~unique_lock() because return
// return reference to unprotected heap
return foo[bar];
After:
unique_lock<mutex> lock(some_mutex);
// make copy of object on heap protected by mutex lock
auto tmp_copy = foo[bar];
// run lock.~unique_lock() because return
// pass locally allocated object to copy constructor
return tmp_copy;
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Before:
unique_lock<mutex> lock(some_mutex);
// run lock.~unique_lock() because return
// return reference to unprotected heap
return foo[bar];
After:
unique_lock<mutex> lock(some_mutex);
// make copy of object on heap protected by mutex lock
auto tmp_copy = foo[bar];
// run lock.~unique_lock() because return
// pass locally allocated object to copy constructor
return tmp_copy;
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we release the lock first (and C++ destructor order says we do), then
the return value will be constructed from data living in an unprotected
container object. That data might be destroyed before we get to the
copy constructor for the return value.
Make a temporary copy of the return value that won't be destroyed by any
other thread, then unlock the mutex, then return the copy object.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Get rid of the ResourceHolder class.
Fix GCC static template member instantiation issues.
Replace assert() with exceptions.
shared_ptr can't seem to do reference counting in a multi-threaded
environment. The code looks correct (for both ResourceHandle and
std::shared_ptr); however, continual segfaults don't lie.
Carpet-bomb with mutex locks to reduce the likelihood of losing shared_ptr
races.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Found by valgrind. It was mostly harmless because the range of
usable values is limited by m_burst (which was initialized) and 0.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
"s_name" was a thread_local variable, not static, and did not require a
mutex to protect access. A deadlock is possible if a thread triggers an
exception with a handler that attempts to log a message (as the top-level
exception handler in bees does).
Remove multiple unnecessary mutex locks. Rename the thread_local variables
to make their scope clearer.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add MADV_DONTDUMP to the list of advice flags.
There are now three flags which may or may not be supported by the
target kernel. Try each one and log its success or failure separately.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The hash table statistics calculation in BeesHashTable::prefetch_loop
and the data-driven operation of the extent scanner always pulls the
hash table into RAM as fast as the disk will push the data. We never
use the prefetch rate limit, so remove it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This fixes a bug where bees tries to process itself as a btrfs filesystem.
This is a species of bug that I only notice *after* pushing to a public
git repo.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Extend the LockSet class so that the total number of locked (active)
items can be limited. When the limit is reached, no new items can be
locked until some existing locked items are unlocked.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Every git commit was causing bees.cc and bees-hash.cc to be rebuilt,
which was expensive and unnecessary.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The btrfs LOGICAL_INO ioctl has no way to report references to compressed
blocks precisely, so we must always consider all references to a
compressed block, and discard those that do not have the desired offset.
When we encounter compressed shared extents containing a mix of unique
and duplicate data, we attempt to replace all references to the mixed
extent with the same number of references to multiple extents consisting
entirely of unique or duplicate blocks. An early exit from the loop
in BeesResolver::for_each_extent_ref was stopping this operation early,
after replacing as few as one shared reference. This left other shared
references to the unique data on the filesystem, effectively creating
new dup data.
The failing pattern looks like this:
dedup: replace 0x14000..0x18000 from some other extent
copy: 0x10000..0x14000
dedup: replace 0x10000..0x14000 with the copy
[may be multiple dedup lines due to multiple shared references]
copy: 0x18000..0x1c000
[missing dedup 0x18000..0x1c000 with the copy here]
scan: 0x10000 [++++dddd++++] 0x1c000
If the extent 0x10000..0x1c000 is shared and compressed, we will make
a copy of the extent at 0x18000..1c0000. When we try to dedup this
copy extent, LOGICAL_INO will return a mix of references to the data
at logical 0x10000 and 0x18000 (which are both references to the
original shared extent with different offsets). If we break out
of the loop too early, we will stop as soon as a reference to 0x10000
is found, and ignore all other references to the extent we are trying
to remove.
The copy at the beginning of the extent (0x10000..0x14000) usually
works because all references to the extent cover the entire extent.
When bees performs the dedup at 0x14000..0x18000, bees itself creates
the shared references with different offsets.
Uncompressed extents were not affected because LOGICAL_INO can locate
physical blocks precisely if they reside in uncompressed extents.
This change will hurt performance when looking up old physical addresses
that belong to new data, but that is a much less urgent problem.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
btrfs provides a flush on rename when the rename target exists, so the
fsync is not necessary. In the initialization case (when the rename
target does not exist and the implicit flush does not occur), the file
may be empty or a hole after a crash. Bees treats this case the same
as if the file did not exist. Since this condition occurs for only the
first 15 minutes of the lifetime of a bees installation, it's not worth
bothering to fix.
If we attempt to fsync the file ourselves, on a crash with log replay,
btrfs will end up with a directory entry pointing to a non-existent inode.
This directory entry cannot be deleted or renamed except by deleting
the entire subvol. On large filesystems this bug is triggered by nearly
every crash (verified on kernels up to 4.5.7).
Remove the fsync to avoid the btrfs bug, and accept the failure mode
that occurs in the first 15 minutes after a bees install.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Previously, the scan order processed each subvol in order. This required
very large amounts of temporary disk space, as a full filesystem scan
was required before any shared extents could be deduped. If the hash
table RAM was underprovisioned this would mean some shared dup blocks
were removed from the hash table before they could be deduped.
Currently the scan order takes the first unscanned extent from each
subvol. This works well if--and only if--the subvols are either empty
or children of a common ancestor. It forces the same inode/offset pairs
to be read at close to the same time from each subvol.
When a new snapshot is created, this ordering diverts scanning to the
new subvol until it catches up to the existing subvols. For large
filesystems with frequent snapshot creation this means that the scanner
never reaches the end of all subvols. Each new subvol effectively
resets the current scan position for the entire filesystem to zero.
This prevents bees from ever completing the first filesystem scan.
Change the order again, so that we now read one unscanned extent from
each subvol in round-robin fashion. When a new subvol is created, we
share scan time between old and new subvols. This ensures we eventually
finish scanning initial subvols and enter the incremental scanning state.
The cost of this change is more repeated reading of shared extents at
scan time with less benefit from disk-device-level caching; however, the
only way to really fix this problem is to implement scanning on tree 2
(the btrfs extent tree) instead of the subvol trees.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The thread name has an arbitrarily limited size, and we are eventually
removing support for multiple paths in a single bees daemon process.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This gets rid of some more big memsets. It may replace them
with a lot of tiny mallocs, though. If this turns out to be
a bad idea then at least we can easily revert the change.
We really do need some large buffers for BtrfsIoctlSearchKey in some
cases, but we don't need to zero them out first. Don't do that so we
save some CPU.
Reduce the default buffer size to 4K because most BISK users don't get
need much more than 1K. Set the buffer size explicitly to the product of
the number of items and the desired item size in the places that really
need a lot of items.
The current crc64 algorithm is a variant of the Redis implementation.
Change it to a variant of the Adler implementation as described
at https://matt.sh/redis-crcspeed
Test program at https://github.com/PeeJay/crc64-compare
Filesize: 1.1G
Asking crc64-redis to sum "/media/peejay/BTRFS/1/ubuntu-14.04.5-desktop-amd64.iso"...
Asking crc64-adler to sum "/media/peejay/BTRFS/1/ubuntu-14.04.5-desktop-amd64.iso"...
Redis CRC-64: f971f9ac6c8ba458
Adler CRC-64: f971f9ac6c8ba458
Adler throughput: 1659.913308 MB/s
Redis throughput: 437.284661 MB/s
Adler is 3.79x faster than Redis
Signed-off-by: Paul Jones <paul@pauljones.id.au>
It turns out we never use a value for m_buf_size that isn't the default,
and we also never ask for more than a few thousand items; however,
we do spend a ton of time memsetting the huge buffer to zero.
I don't know what the ideal size is, but 16K is a far better guess
than 1MB. Let's reduce it for some immediate CPU benefit, and determine
what the size should be later.
Reported at https://github.com/Zygo/bees/issues/11
We don't _need_ transparent hugepages. We like them because they can
be faster, but it's not a requirement, and some people will disable
transparent hugepages because they make non-Bees-like workloads slow.
Try to use MADV_HUGEPAGE, but if it fails, just log the error and
continue.
MADV_DONTFORK would be useful if we still fork()ed, but we don't currently
do that. It's still a useful flag to have because a fork() with more
than 50% of RAM in mlocked pages would result in a kernel OOM crash.
I don't think it's possible to run Bees on a kernel that does not support
the MADV_DONTFORK flag, so don't bother checking for that flag separately.
Linux kernel commit 7f8e406 ("btrfs: improve delayed refs iterations")
seems to dramatically improve LOGICAL_INO performance. Hopefully this
commit will find its way into mainline Linux soon.
This means that most of the time in Bees is now spent on block reading
(50-75%); however, there is still a big gap between block read and
the sum of everything else we are measuring with the "*_ms" counters.
This gap is about 30% of the run time, so it would be good to find out
what's in the gap.
Add ms counters around the crawl and open calls to capture where we are
spending all the time.
The experiments are over, and the results were not success.
Having two filesystems cohabiting in the same hash table results in a
lot of false positives, each of which requires some heavy IO to resolve.
Using MAP_SHARED to share a beeshash.dat between processes results in
catastrophically bad performance.
These features were abandoned long ago, but some of the code--and even
worse, its documentation--still remains.
Bees wants a hash table false positive rate below 0.1%. With a shared
hash table the FP rate is about the same as the dedup rate. Typically
duplicate files on one filesystem are duplicate on many filesystems.
One or more of Linux VFS and the btrfs mmap(MAP_SHARED) implementation
produce extremely poor performance results. A five-order-of-magnitude
speedup was achieved by implementing paging in userspace with worker
threads. We no longer need the support code for the MAP_SHARED case.
It is still possible to run many BeesContexts in a single process,
but now the only thing contexts share is the FD cache.
Allow relative paths with BEESHOME. These paths will be relative
to the root of the dedup target filesystem.
BEESHOME is now optional. If not specified, '.beeshome' is used.
We don't try to create BEESHOME if it doesn't exist. BEESHOME might
not be on a btrfs filesystem, so we can't insist it be a subvol.
BeesHashTable can now create a beeshash.dat if the file does not already
exist. Currently the default size is one hash table extent (16MB) and
there's no way to change that (yet), so users should still create their
own hash tables for now.
The opening of the hash table is deferred (slightly) in preparation for
hash table resizing.
No doc as the feature is currently unfinished.
I got a little too enthusiastic when redacting the code, and removed some
overloaded functions bees was using. C++ silently found replacements,
and the result was a bug that prevented any data from being persisted
from the hash table.
Fixes: https://github.com/Zygo/bees/issues/7
I accidentally did a pre-push verification on a 32-bit build host.
There were a surprisingly small number of problems, so fix them.
Bees now builds on a 32-bit host. Let's not update README just yet,
though: the 32-bit ioctl support fails immediately after startup on a
64-bit kernel.
"agent" is a nice generic term for the set of things that userspace
btrfs deduplicators are. Let's call it that.
Throw out the awkward and rambling "About" text and use the announcement
from linux-btrfs instead. Terrible English writing I at am.
I'm not surprised that GCC 6 doesn't let me send an ostream ref to itself,
even inside an uninstantiated template specialization. I am a little
surprised I was trying to, and 4.9 let me get away with it.
It's 2016. auto_ptr is deprecated now.
Some things were including vector that don't any more.
https://github.com/Zygo/bees/issues/1
2016-11-24 22:20:11 -05:00
111 changed files with 14003 additions and 4266 deletions
> **Linux kernel version 5.1, 5.2, and 5.3 should not be used with btrfs
due to a severe regression that can lead to fatal metadata corruption.**
This issue is fixed in version 5.4.14 and later.
**Recommended Linux kernel versions for bees are 5.4, 5.10, 5.15, 6.1,
6.6, or 6.12 with recent LTS and -stable updates.** The latest released
kernel as of this writing is 6.12.9, and the earliest supported LTS
kernel is 5.4.
Some optional bees features use kernel APIs introduced in kernel 4.15
(extent scan) and 5.6 (`openat2` support). These bees features are not
available on older kernels. Support for older kernels may be removed
in a future bees release.
bees will not run at all on kernels before 4.2 due to lack of minimal
API support.
Kernel Bug Tracking Table
-------------------------
These bugs are particularly popular among bees users, though not all are specifically relevant to bees:
| First bad kernel | Last bad kernel | Issue Description | Fixed Kernel Versions | Fix Commit
| :---: | :---: | --- | :---: | ---
| - | 4.10 | garbage inserted in read data when reading compressed inline extent followed by a hole | 3.18.89, 4.1.49, 4.4.107, 4.9.71, 4.11 and later | e1699d2d7bf6 btrfs: add missing memset while reading compressed inline extents
| - | 4.14 | spurious warnings from `fs/btrfs/backref.c` in `find_parent_nodes` | 3.16.57, 4.14.29, 4.15.12, 4.16 and later | c8195a7b1ad5 btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes
| 4.15 | 4.18 | compression ratio and performance regression on bees test corpus | improved in 4.19 | 4.14 performance not fully restored yet
| - | 5.0 | silently corrupted data returned when reading compressed extents around a punched hole (bees dedupes all-zero data blocks with holes which can produce a similar effect to hole punching) | 3.16.70, 3.18.137, 4.4.177, 4.9.165, 4.14.108, 4.19.31, 5.0.4, 5.1 and later | 8e928218780e Btrfs: fix corruption reading shared and compressed extents after hole punching
| - | 5.0 | deadlock when dedupe and rename are used simultaneously on the same files | 5.0.4, 5.1 and later | 4ea748e1d2c9 Btrfs: fix deadlock between clone/dedupe and rename
| - | 5.1 | send failure or kernel crash while running send and dedupe on same snapshot at same time | 5.0.18, 5.1.4, 5.2 and later | 62d54f3a7fa2 Btrfs: fix race between send and deduplication that lead to failures and crashes
| - | 5.2 | alternating send and dedupe results in incremental send failure | 4.9.188, 4.14.137, 4.19.65, 5.2.7, 5.3 and later | b4f9a1a87a48 Btrfs: fix incremental send failure after deduplication
| 4.20 | 5.3 | balance convert to single rejected with error on 32-bit CPUs | 5.3.7, 5.4 and later | 7a54789074a5 btrfs: fix balance convert to single on 32-bit host CPUs
| - | 5.3 | kernel crash due to tree mod log issue #1 (often triggered by bees) | 3.16.79, 4.4.195, 4.9.195, 4.14.147, 4.19.77, 5.2.19, 5.3.4, 5.4 and later | efad8a853ad2 Btrfs: fix use-after-free when using the tree modification log
| - | 5.4 | kernel crash due to tree mod log issue #2 (often triggered by bees) | 3.16.83, 4.4.208, 4.9.208, 4.14.161, 4.19.92, 5.4.7, 5.5 and later | 6609fee8897a Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues
| 5.1 | 5.4 | metadata corruption resulting in loss of filesystem when a write operation occurs while balance starts a new block group. **Do not use kernel 5.1 with btrfs.** Kernel 5.2 and 5.3 have workarounds that may detect corruption in progress and abort before it becomes permanent, but do not prevent corruption from occurring. Also kernel crash due to tree mod log issue #4. | 5.4.14, 5.5 and later | 6282675e6708 btrfs: relocation: fix reloc_root lifespan and access
| - | 5.4 | send performance failure when shared extents have too many references | 4.9.207, 4.14.159, 4.19.90, 5.3.17, 5.4.4, 5.5 and later | fd0ddbe25095 Btrfs: send, skip backreference walking for extents with many references
| 5.0 | 5.5 | dedupe fails to remove the last extent in a file if the file size is not a multiple of 4K | 5.4.19, 5.5.3, 5.6 and later | 831d2fa25ab8 Btrfs: make deduplication with range including the last block work
| 4.5, backported to 3.18.31, 4.1.22, 4.4.4 | 5.5 | `df` incorrectly reports 0 free space while data space is available. Triggered by changes in metadata size, including those typical of large-scale dedupe. Occurs more often starting in 5.3 and especially 5.4 | 4.4.213, 4.9.213, 4.14.170, 4.19.102, 5.4.18, 5.5.2, 5.6 and later | d55966c4279b btrfs: do not zero f_bavail if we have available space
| - | 5.5 | kernel crash due to tree mod log issue #3 (often triggered by bees) | 3.16.84, 4.4.214, 4.9.214, 4.14.171, 4.19.103, 5.4.19, 5.5.3, 5.6 and later | 7227ff4de55d Btrfs: fix race between adding and putting tree mod seq elements and nodes
| - | 5.6 | deadlock when enumerating file references to physical extent addresses while some references still exist in deleted subvols | 5.7 and later | 39dba8739c4e btrfs: do not resolve backrefs for roots that are being deleted
| - | 5.6 | deadlock when many extent reference updates are pending and available memory is low | 4.14.177, 4.19.116, 5.4.33, 5.5.18, 5.6.5, 5.7 and later | 351cbf6e4410 btrfs: use nofs allocations for running delayed items
| - | 5.6 | excessive CPU usage in `LOGICAL_INO` and `FIEMAP` ioctl and increased btrfs write latency in other processes when bees translates from extent physical address to list of referencing files and offsets. Also affects other tools like `duperemove` and `btrfs send` | 5.4.96, 5.7 and later | b25b0b871f20 btrfs: backref, use correct count to resolve normal data refs, plus 3 parent commits. Some improvements also in earlier kernels.
| - | 5.7 | filesystem becomes read-only if out of space while deleting snapshot | 4.9.238, 4.14.200, 4.19.149, 5.4.69, 5.8 and later | 7c09c03091ac btrfs: don't force read-only after error in drop snapshot
| 5.1 | 5.7 | balance, device delete, or filesystem shrink operations loop endlessly on a single block group without decreasing extent count | 5.4.54, 5.7.11, 5.8 and later | 1dae7e0e58b4 btrfs: reloc: clear DEAD\_RELOC\_TREE bit for orphan roots to prevent runaway balance
| - | 5.8 | deadlock in `TREE_SEARCH` ioctl (core component of bees filesystem scanner), followed by regression in deadlock fix | 4.4.237, 4.9.237, 4.14.199, 4.19.146, 5.4.66, 5.8.10 and later | a48b73eca4ce btrfs: fix potential deadlock in the search ioctl, 1c78544eaa46 btrfs: fix wrong address when faulting in pages in the search ioctl
| 5.7 | 5.10 | kernel crash if balance receives fatal signal e.g. Ctrl-C | 5.4.93, 5.10.11, 5.11 and later | 18d3bff411c8 btrfs: don't get an EINTR during drop_snapshot for reloc
| 5.10 | 5.10 | 20x write performance regression | 5.10.8, 5.11 and later | e076ab2a2ca7 btrfs: shrink delalloc pages instead of full inodes
| 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.4.125, 5.10.43, 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
| - | 5.11 | tree mod log issue #5 | 4.4.263, 4.9.263, 4.14.227, 4.19.183, 5.4.108, 5.10.26, 5.11.9, 5.12 and later | dbcc7d57bffc btrfs: fix race when cloning extent buffer during rewind of an old root
| - | 5.12 | tree mod log issue #6 | 4.14.233, 4.19.191, 5.4.118, 5.10.36, 5.11.20, 5.12.3, 5.13 and later | f9690f426b21 btrfs: fix race when picking most recent mod log operation for an old root
| 5.11 | 5.12 | subvols marked for deletion with `btrfs sub del` become permanently undeletable ("ghost" subvols) | 5.12 stopped creation of new ghost subvols | Partially fixed in 8d488a8c7ba2 btrfs: fix subvolume/snapshot deletion not triggered on mount. Qu wrote a [patch](https://github.com/adam900710/linux/commit/9de990fcc8864c376eb28aa7482c54321f94acd4) to allow `btrfs sub del -i` to remove "ghost" subvols, but it was never merged upstream.
| 4.15 | 5.16 | spurious warnings from `fs/fs-writeback.c` when `flushoncommit` is enabled | 5.15.27, 5.16.13, 5.17 and later | a0f0cf8341e3 btrfs: get rid of warning on transaction commit when using flushoncommit
| - | 5.17 | crash during device removal can make filesystem unmountable | 5.15.54, 5.16.20, 5.17.3, 5.18 and later | bbac58698a55 btrfs: remove device item and update super block in the same transaction
| - | 5.18 | wrong superblock num_devices makes filesystem unmountable | 4.14.283, 4.19.247, 5.4.198, 5.10.121, 5.15.46, 5.17.14, 5.18.3, 5.19 and later | d201238ccd2f btrfs: repair super block num_devices automatically
| 5.18 | 5.19 | parent transid verify failed during log tree replay after a crash during a rename operation | 5.18.18, 5.19.2, 6.0 and later | 723df2bcc9e1 btrfs: join running log transaction when logging new name
| 5.12 | 6.0 | space cache corruption and potential double allocations | 5.15.65, 5.19.6, 6.0 and later | ced8ecf026fd btrfs: fix space cache corruption and potential double allocations
| 6.0 | 6.5 | suboptimal allocation in multi-device filesystems due to chunk allocator regression | 6.1.60, 6.5.9, 6.6 and later | 8a540e990d7d btrfs: fix stripe length calculation for non-zoned data chunk allocation
| 6.3, backported to 5.15.107, 6.1.24, 6.2.11 | 6.3 | vmalloc error, failed to allocate pages | 6.3.10, 6.4 and later. Bug (f349b15e183d "mm: vmalloc: avoid warn_alloc noise caused by fatal signal" in v6.3-rc6) backported to 6.1.24, 6.2.11, and 5.15.107. | 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
| 6.2 | 6.3 | `IGNORE_OFFSET` flag ignored in `LOGICAL_INO` ioctl | 6.2.16, 6.3.3, 6.4 and later | 0cad8f14d70c btrfs: fix backref walking not returning all inode refs
| 6.10 | 6.11 | `adding refs to an existing tree ref`, `failed to run delayed ref`, then read-only | 6.11.10, 6.12 and later | 7d493a5ecc26 btrfs: fix incorrect comparison for delayed refs
| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe/clone ioctl on the same extent | - | workaround: avoid doing that
"Last bad kernel" refers to that version's last stable update from
kernel.org. Distro kernels may backport additional fixes. Consult
your distro's kernel support for details.
When the same version appears in both "last bad kernel" and "fixed kernel
version" columns, it means the bug appears in the `.0` release and is
fixed in the stated `.y` release. e.g. a "last bad kernel" of 5.4 and
a "fixed kernel version" of 5.4.14 has the bug in kernel versions 5.4.0
through 5.4.13 inclusive.
A "-" for "first bad kernel" indicates the bug has been present since
the relevant feature first appeared in btrfs.
A "-" for "last bad kernel" indicates the bug has not yet been fixed in
current kernels (see top of this page for which kernel version that is).
In cases where issues are fixed by commits spread out over multiple
kernel versions, "fixed kernel version" refers to the version that
contains the last committed component of the fix.
Workarounds for known kernel bugs
---------------------------------
* **Hangs with concurrent `LOGICAL_INO` and dedupe/clone**: on all
kernel versions so far, multiple threads running `LOGICAL_INO` and
dedupe/clone ioctls at the same time on the same inodes or extents
can lead to a kernel hang. The kernel enters an infinite loop in
`add_all_parents`, where `count` is 0, `ref->count` is 1, and
`btrfs_next_item` or `btrfs_next_old_item` never find a matching ref.
bees has two workarounds for this bug: 1. schedule work so that multiple
threads do not simultaneously access the same inode or the same extent,
and 2. use a brute-force global lock within bees that prevents any
thread from running `LOGICAL_INO` while any other thread is running
dedupe.
Workaround #1 isn't really a workaround, since we want to do the same
thing for unrelated performance reasons. If multiple threads try to
perform dedupe operations on the same extent or inode, btrfs will make
all the threads wait for the same locks anyway, so it's better to have
bees find some other inode or extent to work on while waiting for btrfs
to finish.
Workaround #2 doesn't seem to be needed after implementing workaround
#1, but it's better to be slightly slower than to hang one CPU core
and the filesystem until the kernel is rebooted.
It is still theoretically possible to trigger the kernel bug when
running bees at the same time as other dedupers, or other programs
that use `LOGICAL_INO` like `btdu`, or when performing a reflink clone
operation such as `cp` or `mv`; however, it's extremely difficult to
reproduce the bug without closely cooperating threads.
* **Slow backrefs** (aka toxic extents): On older kernels, under certain
conditions, if the number of references to a single shared extent grows
too high, the kernel consumes more and more CPU while also holding
locks that delay write access to the filesystem. This is no longer
a concern on kernels after 5.7 (or an up-to-date 5.4 LTS version),
but there are still some remains of earlier workarounds for this issue
in bees that have not been fully removed.
bees avoided this bug by measuring the time the kernel spends performing
`LOGICAL_INO` operations and permanently blacklisting any extent or
hash involved where the kernel starts to get slow. In the bees log,
such blocks are labelled as 'toxic' hash/block addresses.
Future bees releases will remove toxic extent detection (it only detects
false positives now) and clear all previously saved toxic extent bits.
* **dedupe breaks `btrfs send` in old kernels**. The bees option
`--workaround-btrfs-send` prevents any modification of read-only subvols
in order to avoid breaking `btrfs send` on kernels before 5.2.
This workaround is no longer necessary to avoid kernel crashes and
send performance failure on kernel 5.4.4 and later. bees will pause
dedupe until the send is finished on current kernels.
`btrfs receive` is not and has never been affected by this issue.
Event counters are used in bees to collect simple branch-coverage
statistics. Every time bees makes a decision, it increments an event
counter, so there are _many_ event counters.
Events are grouped by prefix in their event names, e.g. `block` is block
I/O, `dedup` is deduplication requests, `tmp` is temporary files, etc.
Events with the suffix `_ms` count total milliseconds spent performing
the operation. These are counted separately for each thread, so there
can be more than 1000 ms per second.
There is considerable overlap between some events, e.g. `example_try`
denotes an event that is counted when an action is attempted,
`example_hit` is counted when the attempt succeeds and has a desired
outcome, and `example_miss` is counted when the attempt succeeds but
the desired outcome is not achieved. In most cases `example_try =
example_hit + example_miss + (`example failed and threw an exception`)`,
but some event groups defy such simplistic equations.
addr
----
The `addr` event group consists of operations related to translating `(root,
inode, offset)` tuples (i.e. logical position within a file) into btrfs
virtual block addresses (i.e. physical position on disk).
*`addr_block`: The address of a block was computed.
*`addr_compressed`: Obsolete implementation of `addr_compressed_offset`.
*`addr_compressed_offset`: The address of a compressed block was computed.
*`addr_delalloc`: The address of a block could not be computed due to
delayed allocation. Only possible when using obsolete `FIEMAP` code.
*`addr_eof_e`: The address of a block at EOF that was not block-aligned was computed.
*`addr_from_fd`: The address of a block was computed using a `fd`
(open to the file in question) and `offset` pair.
*`addr_from_root_fd`: The address of a block was computed using
the filesystem root `fd` instead of the open file `fd` for the
`TREE_SEARCH_V2` ioctl. This is obsolete and should probably be removed
at some point.
*`addr_hole`: The address of a block in a hole was computed.
*`addr_magic`: The address of a block cannot be determined in a way
that bees can use (unrecognized flags or flags known to be incompatible
with bees).
*`addr_uncompressed`: The address of an uncompressed block was computed.
*`addr_unrecognized`: The address of a block with unrecognized flags
(i.e. kernel version newer than bees) was computed.
*`addr_unusable`: The address of a block with unusable flags (i.e. flags
that are known to be incompatible with bees) was computed.
adjust
------
The `adjust` event group consists of operations related to translating stored virtual block addresses (i.e. physical position on disk) to `(root, inode, offset)` tuples (i.e. logical positions within files). `BeesResolver::adjust_offset` determines if a single candidate reference from the `LOGICAL_INO` ioctl corresponds to the requested btrfs virtual block address.
*`adjust_compressed_offset_correct`: A block address corresponding to a compressed block was retrieved from the hash table and resolved to a physical block containing data that matches another block bees has already read.
*`adjust_compressed_offset_wrong`: A block address corresponding to a compressed block was retrieved from the hash table and resolved to a physical block containing data that matches the hash but not the data from another block bees has already read (i.e. there was a hash collision).
*`adjust_eof_fail`: A block address corresponding to a block at EOF that was not aligned to a block boundary matched another block bees already read, but the length of the unaligned data in both blocks was not equal. This is usually caused by stale entries in the hash table pointing to blocks that have been overwritten since the hash table entries were created. It can also be caused by hash collisions, but hashes are not yet computed at this point in the code, so this event does not correlate to the `hash_collision` counter.
*`adjust_eof_haystack`: A block address from the hash table corresponding to a block at EOF that was not aligned to a block boundary was processed.
*`adjust_eof_hit`: A block address corresponding to a block at EOF that was not aligned to a block boundary matched a similarly unaligned block that bees already read.
*`adjust_eof_miss`: A block address from the hash table corresponding to a block at EOF that was not aligned to a block boundary did not match a similarly unaligned block that bees already read.
*`adjust_eof_needle`: A block address from scanning the disk corresponding to a block at EOF that was not aligned to a block boundary was processed.
*`adjust_exact`: A block address from the hash table corresponding to an uncompressed data block was processed to find its `(root, inode, offset)` references.
*`adjust_exact_correct`: A block address corresponding to an uncompressed block was retrieved from the hash table and resolved to a physical block containing data that matches another block bees has already read.
*`adjust_exact_wrong`: A block address corresponding to an uncompressed block was retrieved from the hash table and resolved to a physical block containing data that matches the hash but not the data from another block bees has already read (i.e. there was a hash collision).
*`adjust_hit`: A block address was retrieved from the hash table and resolved to a physical block in an uncompressed extent containing data that matches the data from another block bees has already read (i.e. a duplicate match was found).
*`adjust_miss`: A block address was retrieved from the hash table and resolved to a physical block containing a hash that does not match the hash from another block bees has already read (i.e. the hash table contained a stale entry and the data it referred to has since been overwritten in the filesystem).
*`adjust_needle_too_long`: A block address was retrieved from the hash table, but when the corresponding extent item was retrieved, its offset or length were out of range to be a match (i.e. the hash table contained a stale entry and the data it referred to has since been overwritten in the filesystem).
*`adjust_no_match`: A hash collision occurred (i.e. a block on disk was located with the same hash as the hash table entry but different data) . Effectively an alias for `hash_collision` as it is not possible to have one event without the other.
*`adjust_offset_high`: The `LOGICAL_INO` ioctl gave an extent item that does not overlap with the desired block because the extent item ends before the desired block in the extent data.
*`adjust_offset_hit`: A block address was retrieved from the hash table and resolved to a physical block in a compressed extent containing data that matches the data from another block bees has already read (i.e. a duplicate match was found).
*`adjust_offset_low`: The `LOGICAL_INO` ioctl gave an extent item that does not overlap with the desired block because the extent item begins after the desired block in the extent data.
*`adjust_try`: A block address and extent item candidate were passed to `BeesResolver::adjust_offset` for processing.
block
-----
The `block` event group consists of operations related to reading data blocks from the filesystem.
*`block_bytes`: Number of data bytes read.
*`block_hash`: Number of block hashes computed.
*`block_ms`: Total time reading data blocks.
*`block_read`: Number of data blocks read.
*`block_zero`: Number of data blocks read with zero contents (i.e. candidates for replacement with a hole).
bug
---
The `bug` event group consists of known bugs in bees.
*`bug_bad_max_transid`: A bad `max_transid` was found and removed in `beescrawl.dat`.
*`bug_bad_min_transid`: A bad `min_transid` was found and removed in `beescrawl.dat`.
*`bug_dedup_same_physical`: `BeesContext::dedup` detected that the physical extent was the same for `src` and `dst`. This has no effect on space usage so it is a waste of time, and also carries the risk of creating a toxic extent.
*`bug_grow_pair_overlaps`: Two identical blocks were found, and while searching matching adjacent extents, the potential `src` grew to overlap the potential `dst`. This would create a cycle where bees keeps trying to eliminate blocks but instead just moves them around.
*`bug_hash_duplicate_cell`: Two entries in the hash table were identical. This only happens due to data corruption or a bug.
*`bug_hash_magic_addr`: An entry in the hash table contains an address with magic. Magic addresses cannot be deduplicated so they should not be stored in the hash table.
chase
-----
The `chase` event group consists of operations connecting btrfs virtual block addresses with `(root, inode, offset)` tuples. `resolve` is the top level, `adjust` is the bottom level, and `chase` is the middle level. `BeesResolver::chase_extent_ref` iterates over `(root, inode, offset)` tuples from `LOGICAL_INO` and attempts to find a single matching block in the filesystem given a candidate block from an earlier `scan` operation.
*`chase_corrected`: A matching block was resolved to a `(root, inode, offset)` tuple, but the offset of a block matching data did not match the offset given by `LOGICAL_INO`.
*`chase_hit`: A block address was successfully and correctly translated to a `(root, inode, offset)` tuple.
*`chase_no_data`: A block address was not successfully translated to a `(root, inode, offset)` tuple.
*`chase_no_fd`: A `(root, inode)` tuple could not be opened (i.e. the file was deleted on the filesystem).
*`chase_try`: A block address translation attempt started.
*`chase_uncorrected`: A matching block was resolved to a `(root, inode, offset)` tuple, and the offset of a block matching data did match the offset given by `LOGICAL_INO`.
*`chase_wrong_addr`: The btrfs virtual address (i.e. physical block address) found at a candidate `(root, inode, offset)` tuple did not match the expected btrfs virtual address (i.e. the filesystem was modified during the resolve operation).
*`chase_wrong_magic`: The extent item at a candidate `(root, inode, offset)` tuple has magic bits and cannot match any btrfs virtual address in the hash table (i.e. the filesystem was modified during the resolve operation).
crawl
-----
The `crawl` event group consists of operations related to scanning btrfs trees to find new extent refs to scan for dedupe.
*`crawl_again`: An inode crawl was restarted because the extent was already locked by another running crawl.
*`crawl_blacklisted`: An extent was not scanned because it belongs to a blacklisted file.
*`crawl_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
*`crawl_done`: One pass over a subvol was completed.
*`crawl_discard_high`: An extent that was too large for the crawler's size tier was discarded.
*`crawl_discard_low`: An extent that was too small for the crawler's size tier was discarded.
*`crawl_empty`: A `TREE_SEARCH_V2` ioctl call failed or returned an empty set (usually because all data in the subvol was scanned).
*`crawl_extent`: The extent crawler queued all references to an extent for processing.
*`crawl_fail`: A `TREE_SEARCH_V2` ioctl call failed.
*`crawl_flop`: Small extent items were not skipped because the next extent started at or before the end of the previous extent.
*`crawl_gen_high`: An extent item in the search results refers to an extent that is newer than the current crawl's `max_transid` allows.
*`crawl_gen_low`: An extent item in the search results refers to an extent that is older than the current crawl's `min_transid` allows.
*`crawl_hole`: An extent item in the search results refers to a hole.
*`crawl_inline`: An extent item in the search results contains an inline extent.
*`crawl_items`: An item in the `TREE_SEARCH_V2` data was processed.
*`crawl_ms`: Time spent running the `TREE_SEARCH_V2` ioctl.
*`crawl_no_empty`: Attempted to delete the last crawler. Should never happen.
*`crawl_nondata`: An item in the search results is not data.
*`crawl_prealloc`: An extent item in the search results refers to a `PREALLOC` extent.
*`crawl_push`: An extent item in the search results is suitable for scanning and deduplication.
*`crawl_scan`: An extent item in the search results is submitted to `BeesContext::scan_forward` for scanning and deduplication.
*`crawl_skip`: Small extent items were skipped because no extent of sufficient size was found within the minimum search distance.
*`crawl_skip_ms`: Time spent skipping small extent items.
*`crawl_search`: A `TREE_SEARCH_V2` ioctl call was successful.
*`crawl_throttled`: Extent scan created too many work queue items and was prevented from creating any more.
*`crawl_tree_block`: Extent scan found and skipped a metadata tree block.
*`crawl_unknown`: An extent item in the search results has an unrecognized type.
*`crawl_unthrottled`: Extent scan allowed to create work queue items again.
dedup
-----
The `dedup` (sic) event group consists of operations that deduplicate data.
*`dedup_bytes`: Total bytes in extent references deduplicated.
*`dedup_copy`: Total bytes copied to eliminate unique data in extents containing a mix of unique and duplicate data.
*`dedup_hit`: Total number of pairs of identical extent references.
*`dedup_miss`: Total number of pairs of non-identical extent references.
*`dedup_ms`: Total time spent running the `FILE_EXTENT_SAME` (aka `FI_DEDUPERANGE` or `dedupe_file_range`) ioctl.
*`dedup_prealloc_bytes`: Total bytes in eliminated `PREALLOC` extent references.
*`dedup_prealloc_hit`: Total number of successfully eliminated `PREALLOC` extent references.
*`dedup_prealloc_hit`: Total number of unsuccessfully eliminated `PREALLOC` extent references (i.e. filesystem data changed between scan and dedupe).
*`dedup_try`: Total number of pairs of extent references submitted for deduplication.
*`dedup_workaround_btrfs_send`: Total number of extent reference pairs submitted for deduplication that were discarded to workaround `btrfs send` bugs.
exception
---------
The `exception` event group consists of C++ exceptions. C++ exceptions are thrown due to IO errors and internal constraint check failures.
*`exception_caught`: Total number of C++ exceptions thrown and caught by a generic exception handler.
*`exception_caught_silent`: Total number of "silent" C++ exceptions thrown and caught by a generic exception handler. These are exceptions which are part of the correct and normal operation of bees. The exceptions are logged at a lower log level.
extent
------
The `extent` event group consists of events that occur within the extent scanner.
*`extent_deferred_inode`: A lock conflict was detected when two worker threads attempted to manipulate the same inode at the same time.
*`extent_empty`: A complete list of references to an extent was created but the list was empty, e.g. because all refs are in deleted inodes or snapshots.
*`extent_fail`: An ioctl call to `LOGICAL_INO` failed.
*`extent_forward`: An extent reference was submitted for scanning.
*`extent_mapped`: A complete map of references to an extent was created and added to the crawl queue.
*`extent_ok`: An ioctl call to `LOGICAL_INO` completed successfully.
*`extent_overflow`: A complete map of references to an extent exceeded `BEES_MAX_EXTENT_REF_COUNT`, so the extent was dropped.
*`extent_ref_missing`: An extent reference reported by `LOGICAL_INO` was not found by later `TREE_SEARCH_V2` calls.
*`extent_ref_ok`: One extent reference was queued for scanning.
*`extent_restart`: An extent reference was requeued to be scanned again after an active extent lock is released.
*`extent_retry`: An extent reference was requeued to be scanned again after an active inode lock is released.
*`extent_skip`: A 4K extent with more than 1000 refs was skipped.
*`extent_zero`: An ioctl call to `LOGICAL_INO` succeeded, but reported an empty list of extents.
hash
----
The `hash` event group consists of operations related to the bees hash table.
*`hash_already`: A `(hash, address)` pair was already present in the hash table during a `BeesHashTable::push_random_hash_addr` operation.
*`hash_bump`: An existing `(hash, address)` pair was moved forward in the hash table by a `BeesHashTable::push_random_hash_addr` operation.
*`hash_collision`: A pair of data blocks was found with identical hashes but different data.
*`hash_erase`: A `(hash, address)` pair in the hash table was removed because a matching data block could not be found in the filesystem (i.e. the hash table entry is out of date).
*`hash_erase_miss`: A `(hash, address)` pair was reported missing from the filesystem but no such entry was found in the hash table (i.e. race between scanning threads or pair already evicted).
*`hash_evict`: A `(hash, address)` pair was evicted from the hash table to accommodate a new hash table entry.
*`hash_extent_in`: A hash table extent was read.
*`hash_extent_out`: A hash table extent was written.
*`hash_front`: A `(hash, address)` pair was pushed to the front of the list because it matched a duplicate block.
*`hash_front_already`: A `(hash, address)` pair was pushed to the front of the list because it matched a duplicate block, but the pair was already at the front of the list so no change occurred.
*`hash_insert`: A `(hash, address)` pair was inserted by `BeesHashTable::push_random_hash_addr`.
*`hash_lookup`: The hash table was searched for `(hash, address)` pairs matching a given `hash`.
open
----
The `open` event group consists of operations related to translating `(root, inode)` tuples into open file descriptors (i.e. `open_by_handle` emulation for btrfs).
*`open_clear`: The open FD cache was cleared to avoid keeping file descriptors open too long.
*`open_fail_enoent`: A file could not be opened because it no longer exists (i.e. it was deleted or renamed during the lookup/resolve operations).
*`open_fail_error`: A file could not be opened for other reasons (e.g. IO error, permission denied, out of resources).
*`open_file`: A file was successfully opened. This counts only the `open()` system call, not other reasons why the opened FD might not be usable.
*`open_hit`: A file was successfully opened and the FD was acceptable.
*`open_ino_ms`: Total time spent executing the `open()` system call.
*`open_lookup_empty`: No paths were found for the inode in the `INO_PATHS` ioctl.
*`open_lookup_enoent`: The `INO_PATHS` ioctl returned ENOENT.
*`open_lookup_error`: The `INO_PATHS` ioctl returned a different error.
*`open_lookup_ok`: The `INO_PATHS` ioctl successfully returned a list of one or more filenames.
*`open_no_path`: All attempts to open a file by `(root, inode)` pair failed.
*`open_no_root`: An attempt to open a file by `(root, inode)` pair failed because the `root` could not be opened.
*`open_root_ms`: Total time spent opening subvol root FDs.
*`open_wrong_dev`: A FD returned by `open()` did not match the device belonging to the filesystem subvol.
*`open_wrong_flags`: A FD returned by `open()` had incompatible flags (`NODATASUM` / `NODATACOW`).
*`open_wrong_ino`: A FD returned by `open()` did not match the expected inode (i.e. the file was renamed or replaced during the lookup/resolve operations).
*`open_wrong_root`: A FD returned by `open()` did not match the expected subvol ID (i.e. `root`).
pairbackward
------------
The `pairbackward` event group consists of events related to extending matching block ranges backward starting from the initial block match found using the hash table.
*`pairbackward_bof_first`: A matching pair of block ranges could not be extended backward because the beginning of the first (src) file was reached.
*`pairbackward_bof_second`: A matching pair of block ranges could not be extended backward because the beginning of the second (dst) file was reached.
*`pairbackward_hit`: A pair of matching block ranges was extended backward by one block.
*`pairbackward_miss`: A pair of matching block ranges could not be extended backward by one block because the pair of blocks before the first block in the range did not contain identical data.
*`pairbackward_ms`: Total time spent extending matching block ranges backward from the first matching block found by hash table lookup.
*`pairbackward_overlap`: A pair of matching block ranges could not be extended backward by one block because this would cause the two block ranges to overlap.
*`pairbackward_same`: A pair of matching block ranges could not be extended backward by one block because this would cause the two block ranges to refer to the same btrfs data extent.
*`pairbackward_stop`: Stopped extending a pair of matching block ranges backward for any of the reasons listed here.
*`pairbackward_toxic_addr`: A pair of matching block ranges was abandoned because the extended range would include a data block with a toxic address.
*`pairbackward_toxic_hash`: A pair of matching block ranges was abandoned because the extended range would include a data block with a toxic hash.
*`pairbackward_try`: Started extending a pair of matching block ranges backward.
*`pairbackward_zero`: A pair of matching block ranges could not be extended backward by one block because the src block contained all zeros and was not compressed.
pairforward
-----------
The `pairforward` event group consists of events related to extending matching block ranges forward starting from the initial block match found using the hash table.
*`pairforward_eof_first`: A matching pair of block ranges could not be extended forward because the end of the first (src) file was reached.
*`pairforward_eof_malign`: A matching pair of block ranges could not be extended forward because the end of the second (dst) file was not aligned to a 4K boundary nor the end of the first (src) file.
*`pairforward_eof_second`: A matching pair of block ranges could not be extended forward because the end of the second (dst) file was reached.
*`pairforward_hit`: A pair of matching block ranges was extended forward by one block.
*`pairforward_hole`: A pair of matching block ranges was extended forward by one block, and the block was a hole in the second (dst) file.
*`pairforward_miss`: A pair of matching block ranges could not be extended forward by one block because the pair of blocks after the last block in the range did not contain identical data.
*`pairforward_ms`: Total time spent extending matching block ranges forward from the first matching block found by hash table lookup.
*`pairforward_overlap`: A pair of matching block ranges could not be extended forward by one block because this would cause the two block ranges to overlap.
*`pairforward_same`: A pair of matching block ranges could not be extended forward by one block because this would cause the two block ranges to refer to the same btrfs data extent.
*`pairforward_stop`: Stopped extending a pair of matching block ranges forward for any of the reasons listed here.
*`pairforward_toxic_addr`: A pair of matching block ranges was abandoned because the extended range would include a data block with a toxic address.
*`pairforward_toxic_hash`: A pair of matching block ranges was abandoned because the extended range would include a data block with a toxic hash.
*`pairforward_try`: Started extending a pair of matching block ranges forward.
*`pairforward_zero`: A pair of matching block ranges could not be extended backward by one block because the src block contained all zeros and was not compressed.
progress
--------
The `progress` event group consists of events related to progress estimation.
*`progress_no_data_bg`: Failed to retrieve any data block groups from the filesystem.
*`progress_not_created`: A crawler for one size tier had not been created for the extent scanner.
*`progress_complete`: A crawler for one size tier has completed a scan.
*`progress_not_found`: The extent position for a crawler does not correspond to any block group.
*`progress_out_of_bg`: The extent position for a crawler does not correspond to any data block group.
*`progress_ok`: Table of progress and ETA created successfully.
readahead
---------
The `readahead` event group consists of events related to data prefetching (formerly calls to `posix_fadvise` or `readahead`, but now emulated in userspace).
*`readahead_bytes`: Number of bytes prefetched.
*`readahead_count`: Number of read calls.
*`readahead_clear`: Number of times the duplicate read cache was cleared.
*`readahead_fail`: Number of read errors during prefetch.
*`readahead_ms`: Total time spent emulating readahead in user-space (kernel readahead is not measured).
*`readahead_skip`: Number of times a duplicate read was identified in the cache and skipped.
*`readahead_unread_ms`: Total time spent running `posix_fadvise(..., POSIX_FADV_DONTNEED)`.
replacedst
----------
The `replacedst` event group consists of events related to replacing a single reference to a dst extent using any suitable src extent (i.e. eliminating a single duplicate extent ref during a crawl).
*`replacedst_dedup_hit`: A duplicate extent reference was identified and removed.
*`replacedst_dedup_miss`: A duplicate extent reference was identified, but src and dst extents did not match (i.e. the filesystem changed in the meantime).
*`replacedst_grown`: A duplicate block was identified, and adjacent blocks were duplicate as well.
*`replacedst_overlaps`: A pair of duplicate block ranges was identified, but the pair was not usable for dedupe because the two ranges overlap.
*`replacedst_same`: A pair of duplicate block ranges was identified, but the pair was not usable for dedupe because the physical block ranges were the same.
*`replacedst_try`: A duplicate block was identified and an attempt was made to remove it (i.e. this is the total number of replacedst calls).
replacesrc
----------
The `replacesrc` event group consists of events related to replacing every reference to a src extent using a temporary copy of the extent's data (i.e. eliminating leftover unique data in a partially duplicate extent during a crawl).
*`replacesrc_dedup_hit`: A duplicate extent reference was identified and removed.
*`replacesrc_dedup_miss`: A duplicate extent reference was identified, but src and dst extents did not match (i.e. the filesystem changed in the meantime).
*`replacesrc_grown`: A duplicate block was identified, and adjacent blocks were duplicate as well.
*`replacesrc_overlaps`: A pair of duplicate block ranges was identified, but the pair was not usable for dedupe because the two ranges overlap.
*`replacesrc_try`: A duplicate block was identified and an attempt was made to remove it (i.e. this is the total number of replacedst calls).
resolve
-------
The `resolve` event group consists of operations related to translating a btrfs virtual block address (i.e. physical block address) to a `(root, inode, offset)` tuple (i.e. locating and opening the file containing a matching block). `resolve` is the top level, `chase` and `adjust` are the lower two levels.
*`resolve_empty`: The `LOGICAL_INO` ioctl returned successfully with an empty reference list (0 items).
*`resolve_fail`: The `LOGICAL_INO` ioctl returned an error.
*`resolve_large`: The `LOGICAL_INO` ioctl returned more than 2730 results (the limit of the v1 ioctl).
*`resolve_ms`: Total time spent in the `LOGICAL_INO` ioctl (i.e. wallclock time, not kernel CPU time).
*`resolve_ok`: The `LOGICAL_INO` ioctl returned success.
*`resolve_overflow`: The `LOGICAL_INO` ioctl returned 9999 or more extents (the limit configured in `bees.h`).
*`resolve_toxic`: The `LOGICAL_INO` ioctl took more than 0.1 seconds of kernel CPU time.
root
----
The `root` event group consists of operations related to translating a btrfs root ID (i.e. subvol ID) into an open file descriptor by navigating the btrfs root tree.
*`root_clear`: The root FD cache was cleared.
*`root_found`: A root FD was successfully opened.
*`root_notfound`: A root FD could not be opened because all candidate paths could not be opened, or there were no paths available.
*`root_ok`: A root FD was opened and its correctness verified.
*`root_open_fail`: A root FD `open()` attempt returned an error.
*`root_parent_open_fail`: A recursive call to open the parent of a subvol failed.
*`root_parent_open_ok`: A recursive call to open the parent of a subvol succeeded.
*`root_parent_open_try`: A recursive call to open the parent of a subvol was attempted.
*`root_parent_path_empty`: No path could be found to connect a parent root FD to its child.
*`root_parent_path_fail`: The `INO_PATH` ioctl failed to find a name for a child subvol relative to its parent.
*`root_parent_path_open_fail`: The `open()` call in a recursive call to open the parent of a subvol returned an error.
*`root_workaround_btrfs_send`: A subvol was determined to be read-only and disabled to implement the btrfs send workaround.
scan
----
The `scan` event group consists of operations related to scanning incoming data. This is where bees finds duplicate data and populates the hash table.
*`scan_blacklisted`: A blacklisted extent was passed to `scan_forward` and dropped.
*`scan_block`: A block of data was scanned.
*`scan_compressed_no_dedup`: An extent that was compressed contained non-zero, non-duplicate data.
*`scan_dup_block`: Number of duplicate block references deduped.
*`scan_dup_hit`: A pair of duplicate block ranges was found.
*`scan_dup_miss`: A pair of duplicate blocks was found in the hash table but not in the filesystem.
*`scan_extent`: An extent was scanned (`scan_one_extent`).
*`scan_forward`: A logical byte range was scanned (`scan_forward`).
*`scan_found`: An entry was found in the hash table matching a scanned block from the filesystem.
*`scan_hash_hit`: A block was found on the filesystem corresponding to a block found in the hash table.
*`scan_hash_miss`: A block was not found on the filesystem corresponding to a block found in the hash table.
*`scan_hash_preinsert`: A non-zero data block's hash was prepared for possible insertion into the hash table.
*`scan_hash_insert`: A non-zero data block's hash was inserted into the hash table.
*`scan_hole`: A hole extent was found during scan and ignored.
*`scan_interesting`: An extent had flags that were not recognized by bees and was ignored.
*`scan_lookup`: A hash was looked up in the hash table.
*`scan_malign`: A block being scanned matched a hash at EOF in the hash table, but the EOF was not aligned to a block boundary and the two blocks did not have the same length.
*`scan_push_front`: An entry in the hash table matched a duplicate block, so the entry was moved to the head of its LRU list.
*`scan_reinsert`: A copied block's hash and block address was inserted into the hash table.
*`scan_resolve_hit`: A block address in the hash table was successfully resolved to an open FD and offset pair.
*`scan_resolve_zero`: A block address in the hash table was not resolved to any subvol/inode pair, so the corresponding hash table entry was removed.
*`scan_rewrite`: A range of bytes in a file was copied, then the copy deduped over the original data.
*`scan_root_dead`: A deleted subvol was detected.
*`scan_seen_clear`: The list of recently scanned extents reached maximum size and was cleared.
*`scan_seen_erase`: An extent reference was modified by scan, so all future references to the extent must be scanned.
*`scan_seen_hit`: A scan was skipped because the same extent had recently been scanned.
*`scan_seen_insert`: An extent reference was not modified by scan and its hashes have been inserted into the hash table, so all future references to the extent can be ignored.
*`scan_seen_miss`: A scan was not skipped because the same extent had not recently been scanned (i.e. the extent was scanned normally).
*`scan_skip_bytes`: Nuisance dedupe or hole-punching would save less than half of the data in an extent.
*`scan_skip_ops`: Nuisance dedupe or hole-punching would require too many dedupe/copy/hole-punch operations in an extent.
*`scan_toxic_hash`: A scanned block has the same hash as a hash table entry that is marked toxic.
*`scan_toxic_match`: A hash table entry points to a block that is discovered to be toxic.
*`scan_twice`: Two references to the same block have been found in the hash table.
*`scan_zero`: A data block containing only zero bytes was detected.
scanf
-----
The `scanf` event group consists of operations related to `BeesContext::scan_forward`. This is the entry point where `crawl` schedules new data for scanning.
*`scanf_deferred_extent`: Two tasks attempted to scan the same extent at the same time, so one was deferred.
*`scanf_eof`: Scan past EOF was attempted.
*`scanf_extent`: A btrfs extent item was scanned.
*`scanf_extent_ms`: Total thread-seconds spent scanning btrfs extent items.
*`scanf_no_fd`: References to a block from the hash table were found, but a FD could not be opened.
*`scanf_total`: A logical byte range of a file was scanned.
*`scanf_total_ms`: Total thread-seconds spent scanning logical byte ranges.
Note that in current versions of bees, `scan_forward` is passed extents
that correspond exactly to btrfs extent items, so the `scanf_extent` and
`scanf_total` numbers can only be different if the filesystem changes
between crawl time and scan time.
sync
----
The `sync` event group consists of operations related to the `fsync` workarounds in bees.
*`sync_count`: `fsync()` was called on a temporary file.
*`sync_ms`: Total time spent executing `fsync()`.
tmp
---
The `sync` event group consists of operations related temporary files and the data within them.
*`tmp_aligned`: A temporary extent was allocated on a block boundary.
*`tmp_block`: Total number of temporary blocks copied.
*`tmp_block_zero`: Total number of temporary hole blocks copied.
*`tmp_bytes`: Total number of temporary bytes copied.
*`tmp_copy`: Total number of extents copied.
*`tmp_copy_ms`: Total time spent copying extents.
*`tmp_create`: Total number of temporary files created.
*`tmp_create_ms`: Total time spent creating temporary files.
*`tmp_hole`: Total number of hole extents created.
*`tmp_realign`: A temporary extent was not aligned to a block boundary.
*`tmp_resize`: A temporary file was resized with `ftruncate()`
*`tmp_resize_ms`: Total time spent in `ftruncate()`
*`tmp_trunc`: The temporary file size limit was exceeded, triggering a new temporary file creation.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.