The deadlock seems to be fixed now (if there ever was one--there certainly
were deadlocks, but matching deadlocks to root causes is non-trivial
and a number of distinct deadlock cases have been fixed in recent years).
The benchmark data is inconclusive about whether it is better to fsync or
not to fsync. A paranoia option might be useful here.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The crawl_master task had a simple atomic variable that was supposed
to prevent duplicate crawl_master tasks from ending up in the queue;
however, this had a race condition that could lead to m_task_running
being set with no crawl_master task running to clear it. This would in
turn prevent crawl_thread from scheduling any further crawl_master tasks,
and bees would eventually stop doing any more work.
A proper fix is to modify the Task class and its friends such that
Task::run() guarantees that 1) at most one instance of a Task is ever
scheduled or running at any time, and 2) if a Task is scheduled while
an instance of the Task is running, the scheduling is deferred until
after the current instance completes. This is part of a fairly large
planned change set, but it's not ready to push now.
So instead, unconditionally push a new crawl_master Task into the queue
on every poll, then silently and quickly exit if the queue is too full
or the supply of new extents is empty. Drop the scheduling-related
members of BeesRoots as they will not be needed when the proper fix lands.
Fixes: 4f0bc78a "crawl: don't block a Task waiting for new transids"
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If /bin/sh is bash, the 'type' builtin produces a list of filenames
that match the arguments to $PATH.
If /bin/sh is dash, we get errors like:
/bin/sh: 1: P:: not found
Hopefully having a build-dep on bash is not controversial.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The two files are identical except README.md links to docs/* while
index.md links to *.
A sed script can do that transformation, so use sed to do it.
This does modify a file in git, but this is necessary to make all
the Github views work consistently.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This sequence of log messages isn't clear:
crawl_master: WORKAROUND: Avoiding RO subvol 6094
crawl_master: WORKAROUND: RO root 6094
The first is from a cache miss, and appears wherever a root is opened
(dedupe or crawl). The second is skipping an entire subvol scan, and
only happens in crawl_master.
Elaborate on the second message a little.
Also use the term "root" consistently when referring to subvol tree IDs.
btrfs refers to these objects by (at least) three distinct names: tree,
subvol, and root. Using three different words for the same thing is worse
than using a single wrong word consistently to refer to the same concept.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
After weeks of testing I copied part of a change to main without copying
the rest of the change, leading to an immediate segfault on startup.
So here is the rest of the change: limit the number of
BeesContexts per process to 1. This change was discussed at
https://github.com/Zygo/bees/issues/54#issuecomment-360332529 but there
are more reasons to do it now: the candidates to replace the current
hash table format are less forgiving of sharing hash tables, and it may
even become necessary to have more than one hash table per BeesContext
instance (e.g. to keep datasum and nodatasum data separate).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
https://github.com/Zygo/bees/issues/91 describes problems encountered
when running bees on systems with many CPU cores.
Limit the computed number of threads (using --thread-factor or the
default) to a maximum of 8 (i.e. the number of logical cores in a modern
laptop). Users can override the limit by using --thread-count.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
options.md was a disorganized mess that markdown couldn't parse properly.
Break the options list down into sections by theme. Add the new
'--workaround-btrfs-send' option to the new 'Workarounds' section.
Clean up the rest of the text and fix some inconsistencies.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Introduce --workaround options which trade performance or effectiveness to
avoid triggering kernel bugs.
The first such option is --workaround-btrfs-send, which avoids making any
modification to read-only subvols to avoid btrfs send bugs.
Clean up usage message: no tabs for formatting, split options into
sections by theme.
Make scan mode a non-static data member like all (most?) other options.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We didn't take enough care to fix all invocations of git in this
scenario.
Fixes: 32d2739 ("Makefile: Specify version when building from tarball")
Signed-off-by: Kai Krakow <kai@kaishome.de>
The log message is quite CPU-intensive to generate, and some data sets
have enough hash collisions to throw off benchmarks.
Keep the event counter but drop the log message.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Make sure the result set is empty before running the ioctl in case
something tries to consume the result without checking the error status.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
If we are not zero-filling containers then the overhead of allocating them
on each use is negligible. The effect that the thread_local containers
were having on RAM usage was very non-negligible.
Use dynamic containers (members or stack objects) for better control
of object lifetimes and much lower peak RAM usage. They're a tiny bit
faster, too.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit brings back -O3 but in an overridable way. This should make
downstream distributions happy enough to accept it.
While at the subject, let's apply the same fixup logic to LDFLAGS, too.
This commit also properly gets rid of the implicit rules which collided
too easily with the depends.mk.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Automatically fall back to LOGICAL_INO if LOGICAL_INO_V2 fails and no
_V2 flags are used.
Add methods to set the flags argument with build portability to older
headers.
Use thread_local storage for the somewhat large buffers used by
LOGICAL_INO_V2 (and other users of BtrfsDataContainer like INO_PATHS).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Better toxic extent detection means we can now handle extents with
many more references--easily hundreds of thousands.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Replace CPU shares and IO block weight by CPU weight and IO weight. Note that new parameters are roughly 1/100 of old one--I believe that's the right conversion. Also removed duplicate Nice parameter and alphabetized the parameters for ease of reading.
Faster and more reliable toxic extent detection means we can now be much
less paranoid about creating toxic extents.
The paranoia has significant impact on dedupe hit rates because every
extent that contains even one toxic hash is abandoned. The preloaded
toxic hashes were chosen because they occur more frequently than any
other block contents in typical filesystem data. The combination of these
resulted in as much as 30% of duplicate extents being left untouched.
Remove the preloaded toxic extent blacklist, and rely on the new
kernel-CPU-usage-based workaround instead.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Leave AL16M defined in beesd to avoid breaking scripts based on
beesd.conf.sample which used this constant.
Use the absolute size in beesd.conf.sample to avoid any future problems.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We detect toxic extents by measuring how long the LOGICAL_INO ioctl takes
to run. If it is above some threshold, we consider the extent toxic,
and blacklist it; otherwise, we process the extent normally.
The detector was using the execution time of the ioctl, which detects
toxic extents, but it also detects pauses of the bees process and
transaction commit latency due to load. This leads to a significant
number of false positives. The detection threshold was also very long,
burning a lot of kernel CPU before the detection was triggered.
Use the per-thread system CPU statistics to measure the kernel CPU usage
of the LOGICAL_INO call directly. This is much more reliable because it
is not confounded by other threads, and it's faster because we can set
the time threshold two orders of magnitude lower.
Also remove the lock and mutex added in "context: serialize LOGICAL_INO
calls" because we theoretically no longer need it (but leave the code
there with #if 0 in case we do need it in practice).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
ROOT_TREE contains the ROOT_ITEM for EXTENT_TREE. Every modification
(that we care about) to a btrfs must go through EXTENT_TREE, and must
modify the page in ROOT_TREE pointing to the root of EXTENT_TREE...
which makes that a very good source for the filesystem transid.
Remove the loop and the root lookups, and just look at one item for
max_transid.
Also note that every caller of transid_max_nocache() immediately
feeds the return value to m_transid_re.update(), so don't do that
inside transid_max_nocache().
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It turns out that we do need to scan all the subvols in order
to find transid_max.
Keep the bug fix though.
This reverts commit bf6ae80eeec6afcbee505d22af8e62f60dc1c9a6.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesRoots::transid_max_nocache calls btrfs_get_root_transid() which
retrieves the transid of the root of the given Fd. Since the FS_TREE
(subvol 5) is the root of the subvol hierarchy, it will always have
the highest transid on the filesystem, and we do not need to look at
any others.
Also fix a bug where we pass BTRFS_FS_TREE_OBJECTID instead of the
file descriptor root_fd() to btrfs_get_root_transid(). If BEESHOME
is somewhere on the same btrfs filesystem, and there are no leaked FDs
at bees startup, then BTRFS_FS_TREE_OBJECTID (5) usually has the same
integer value as a valid file descriptor of some object on the filesystem
that has a regularly increasing transid value. If Fd 5 happens to be a
file in BEESHOME then bees itself drives the transid increments. This,
combined with the search of all subvol roots, hides the bug (unless Fd
5 gets closed somehow).
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BeesContext::home_fd() is supposed to open $BEESHOME once and cache
the Fd for later calls; however, instead it was reopening a new Fd each
time it was called, and _also_ holding that Fd in a BeesContext member.
Fds clean themselves up when they are forgotten, so it was not leaking
per se, but it certainly had more open Fds than it needed to.
Check to see if we have m_home_fd open, and return that if so.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
LOGICAL_INO can trip over the btrfs slow-backrefs bug, resulting in
some very long in-kernel runtimes. If too many threads are executing
LOGICAL_INO then there may be no cores left on the system to run other
tasks.
Toxic extent detection is done by a very rudimentary algorithm which
can be confused by unrelated sources of latency within btrfs (especially
commit latency). The algorithm can also be confused by other threads
executing the LOGICAL_INO ioctl.
These are two good reasons to prevent any two threads in a single bees
process instance from executing LOGICAL_INO at the same time, so let's
do that.
It is possible to limit the number of threads executing LOGICAL_INO with
the -c and -C options; however, this also limits the number of threads
which can perform any operation, while only LOGICAL_INO (*) has such a
profound effect on the rest of system operation.
Also make the status message clearer about exactly when LOGICAL_INO is
executed, as opposed to merely waiting to acquire a lock before executing
the ioctl.
(*) or maybe FILE_EXTENT_SAME. The problem function that keeps showing
up in kernel stack traces is find_parent_nodes, which is called by both
the LOGICAL_INO and FILE_EXTENT_SAME ioctls. We'll try this change
first and see if it prevents any recurrences of forced watchdog reboots;
if it does not, then we'll limit FILE_EXTENT_SAME the same way.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The ordering function for BeesCrawlState did not consider
root 292 inode 0 min_transid 2345 max_transid 3456
to be larger than
root 292 inode 258 min_transid 2345 max_transid 2345
so when we attempted to update the end pointer for the crawl progress,
the new state was not considered newer than the old state because the
min_transid was equal, but the new crawl state's inode number was smaller.
Normally this is not a problem because subvol scans typically begin
and end in separate transactions (in part because we don't start a
subvol scan until at least two transactions are available); however,
the cleanup code for the aftermath of the recent transid_min() bug can
create crawlers with equal max_transid and min_transid records.
Fix this by ordering both transid fields before any others in the
crawl state.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Due to an earlier bug some beescrawl.dat files will contain uint64_t
max as max_transid. This prevents any further scanning on the subvol
because there is no possibiity of having a real transid (or any other
uint64_t number) larger than uint64_t max.
If we detect a bad transid in beescrawl.dat, log a warning, then use
some more plausible value: either min_transid to repeat the previous
incremental crawl, or 0 to restart the subvol scan from the beginning.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
On a few test machines max_transid on subvols is getting set to
18446744073709551615 (aka uint64_t max).
Prevent transid_min() from ever returning this value.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
"saved" is used only during hash table correctness analysis, which is
normally not enabled at compile time, and requires source modification
to enable.
Remove the pointless copy and save a tiny bit of CPU.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The 16MB hash table extent size did not serve any useful defragmentation
or compression purpose, and for very small filesystems (under 100GB),
16MB is much larger than necessary.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
systemd-coredumpctl collects core files for later analysis
with gdb. It's a convenient thing if the keys you use to encrypt
/var/lib/systemd/coredump are the same as the keys you use to encrypt
the filesystem where you're running bees.
Add it to the documentation just before the hand-rolled version.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Standard crash backtrace collection, plus $BEESSTATUS for the high-level
overview of what bees is doing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Split the rather large README into smaller sections with a pitch and
a ToC at the top.
Move the sections into docs/ so that Github Pages can read them.
'make doc' produces a local HTML tree.
Update the kernel bugs and gotchas list.
Add some information that has been accumulating in Github comments.
Remove information about bugs in kernels earlier than 4.14.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When package maintainers build from a tarball, the .git directory does
not exist to extract the version tag. Let's add a hack to work around
this issue and let them specify `BEES_VERSION="v0.y"` on the make
cmdline.
Github-Bug: https://github.com/Zygo/bees/issues/75
Signed-off-by: Kai Krakow <kai@kaishome.de>
Gentoo has officially merged the ebuild into portage as of:
https://github.com/gentoo/gentoo/pull/9925
Let's update the readme and get rid of the `contrib/gentoo-bees`
directory, so we have no potentially outdated information in the future.
Signed-off-by: Kai Krakow <kai@kaishome.de>
Now that the packaging preparations were merged, we should update the
ebuild to reflect the upstream master branch.
Signed-off-by: Kai Krakow <kai@kaishome.de>