Fix the missing symbols that popped up when adding chunk tree to
lib/fs.cc. Also define the missing symbols instead of merely trying to
avoid them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Rearrange the logic in `rlower_bound` so it can cope with a tree
that contains mostly block-aligned objects, with a few exceptions
filtered out by `hdr_stop`.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
In some cases functions already had existing debug stream support
which can be redirected to the new interface. In other cases, new
debug messages are added.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This allows plugging in an ostream at run time so that we can audit all
the search calls we are doing.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
BtrfsFsTreeFetcher was used for early versions of the extent scanner, but
neither subvol nor extent scan now needs an object that is both persistent
and configured to access only one subvol. BtrfsExtentDataFetcher does
the same thing in that case.
Clarify the comments on what the remaining classes do, so that
BtrfsFsTreeFetcher doesn't get inadvertently reinvented in the future.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Binary searches can be extremely slow if the target bytenr is near a
metadata block group, because metadata items are not visible to the
binary search algorithm. In a non-mixed-bg filesystem, there can be
hundreds of thousands of metadata items between data extent items, and
since the binary search algorithm can't see them, it will run searches
that iterate over hundreds of thousands of objects about a dozen times.
This is less of a problem for mixed-bg filesystems because the data and
metadata blocks are not isolated from each other. The binary search
algorithm still can't see the metadata items, but there are usually
some data items close by to prevent the linear item filter from running
too long.
Introduce a new fetcher class (all the good names were taken) that tracks
where the end of the current block group is. When the end of the current
block group is reached in the linear search, skip ahead to a block group
that can contain data items.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Enable use of the ioctl to probe whether two fds refer to the same btrfs,
without throwing an exception.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Tasks are not allowed to be queued more than once, but it is allowed
to queue a Task while it's already running, which means a Task can be
executed on two threads in parallel. Tasks detect this and handle it
by queueing the Task on its own post-exec queue. That in turn leads
to Workers which continually execute the same Task if that Task doesn't
create any new Tasks, while other Tasks sit on the Master queue waiting
for a Worker to dequeue them.
For idle Tasks, we don't want the Task to be rescheduled immediately.
We want the idle Task to execute again after every available Task on
both the main and idle queues has been executed.
Fix these by having each Task reschedule itself on the appropriate
queue when it finishes executing.
Priority queued Tasks should executed in priority order not just one
Task's post-exec queue, but the entire local queue of the TaskConsumer.
Fix this by moving the sort into either the TaskConsumer that receives
a post-exec queue, if there is one, or into the Task that is created
to insert the post-exec queue into a TaskConsumer when one becomes
available in the future.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Tasks using non-priority FIFO dependency tracking can insert themselves
into their own queue, to run the Task again immediately after it exits.
For priority queues, this attempts to splice the post-exec queue into
itself, which doesn't seem like a good idea.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Suppose Task A, B, and C are created in that order, and currently running.
Task T acquires Exclusion E. Task B, A, and C attempt to acquire the
same Exclusion, in that order, but fail because Task T holds it.
The result is Task T with a post-exec queue:
T, [ B, A, C ] sort_requested
Now suppose Task U acquires Exclusion F, then Task T attempts to acquire
Exclusion F. Task T fails to acquire F, so T is inserted into U's
post-exec queue. The result at the end of the execution of T is a tree:
U, [ T ] sort_requested
\-> [ B, A, C ] sort_requested
Task T exits after failing to acquire a lock. When T exits, T will
sort its post-exec queue and submit the post-exec queue for execution
immediately:
Worker 1: U, [ T ] sort_requested
Worker 2: A, B, C
This isn't ideal because T, A, B, and C all depend on at least one
common Exclusion, so they are likely to immediately conflict with T
when U exits and T runs again.
Ideally, A, B, and C would at least remain in a common queue with T,
and ideally that queue is sorted.
Instead of inserting T into U's post-exec queue, insert T and all
of T's post-exec queue, which creates a single flattened Task list:
U, [ T, B, A, C ] sort_requested
Then when U exits, it will sort [ T, B, A, C ] into [ A, B, C, T ],
and run all of the queued Tasks in age priority order:
U exited, [ T, B, A, C ] sort_requested
U exited, [ A, B, C, T ]
[ A, B, C, T ] on TaskConsumer queue
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Task started out as a self-organizing parallel-make algorithm, but ended
up becoming a half-broken wait-die algorithm. When a contended object
is already locked, Tasks enter a FIFO queue to restart and acquire the
lock. This is the "die" part of wait-die (all locks on an Exclusion are
non-blocking, so no Task ever does "wait"). The lock queue is FIFO wrt
_lock acquisition order_, not _Task age_ as required by the wait-die
algorithm.
Make it a 25%-broken wait-die algorithm by sorting the Tasks on lock
queues in order of Task ID, i.e. oldest-first, or FIFO wrt Task age.
This ensures the oldest Task waiting for an object is the one to get
it when it becomes available, as expected from the wait-die algorithm.
This should reduce the amount of time Tasks spend on the execution queue,
and reduce memory usage by avoiding the accumulation of Tasks that cannot
make forward progress.
Note that turning `TaskQueue` into an ordered container would have
undesirable side-effects:
* `std::list` has some useful properties wrt stability of object
location and cost of splicing. Other containers may not have these,
and `std::list` does have a `sort` method.
* Some Task objects are created at the beginning and reused continually,
but we really do want those Tasks to be executed in FIFO order wrt
submission, not Task ID. We can exclude these tasks by only doing the
sorting when a Task is queued for an Exclusin object.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Since we're now using weak symbols for dodgy libc functions, we might
as well do it for gettid() too.
Use the ::gettid() global namespace and let libc override it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
openat2 allows closing more TOCTOU holes, but we can only use it when
the kernel supports it.
This should disappear seamlessly when libc implements the function.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
* Allow RateLimiter to change rate after construction.
* Check range of rate argument in constructor.
* Atomic increment for RateEstimator.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
bees might be unpaused at any time, so make sure that the dynamic load
calculation is ready with a non-zero thread count.
This avoids a delay of up to 5 seconds when responding to SIGUSR2
when loadavg tracking is enabled.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
When paused, TaskConsumer threads will eventually notice the paused
condition and exit; however, there's nothing to restart threads when
exiting the paused state.
When unpausing, and while the lock is already held, create TaskConsumer
threads as needed to reach the target thread count.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Commit 72c3bf8438830b65cae7bdaff126053e562280e5 ("fs: handle ENOENT
within lib") was meant to prevent exceptions when a subvol is deleted.
If the search ioctl fails, the kernel won't set nr_items in the
ioctl output, which means `nr_items` still has the input value. When
ENOENT is detected, `this->nr_items` is set to 0, then later `*this =
ioctl_ptr->key` overwrites `this->nr_items` with the original requested
number of items.
This replaced the ENOENT exception with an exception triggered by
interpreting garbage in the memory buffer. The number of exceptions
was reduced because the memory buffers are frequently reused, but upper
layers would then reject the data or ignore it because it didn't match
the key range.
Fix by setting `ioctl_ptr->key.nr_items`, which then overwrites
`this->nr_items`, so the loop that extracts items from the ioctl data
gets the right number of items (i.e. zero).
Fixes: 72c3bf8438830b65cae7bdaff126053e562280e5 ("fs: handle ENOENT within lib")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
operator<< was a friend class that locked the ByteVector, then invoked
hexdump on the bytevector, which used ByteVector::operator[]...which
locked the ByteVector, resulting in a deadlock.
operator<< shouldn't be a friend class anyway. Make hexdump use the
normal public access methods for ByteVector.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add a second level queue which is only serviced when the local and global
queues are empty.
At some point there might be a need to implement a full priority queue,
but for now two classes are sufficient.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This should help clean up some of the uglier status outputs.
Supports:
* multi-line table cells
* character fills
* sparse tables
* insert, delete by row and column
* vertical separators
and not much else.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Add a master switch to turn off the entire MultiLock infrastructure for
testing, without having to remove and add all the individual entry points.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This prevents the storms of exceptions that occur when a subvol is
deleted. We simply treat the entire tree as if it was empty.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
The kernel has not required a 16 MiB limit on dedupe requests since
v4.18-rc1 b67287682688 ("Btrfs: dedupe_file_range ioctl: remove 16MiB
restriction").
Kernels before v4.18 would truncate the request and return the size
actually deduped in `bytes_deduped`. Kernel v4.18 and later will loop
in the kernel until the entire request is satisfied (although still
in 16 MiB chunks, so larger extents will be split).
Modify the loop in userspace to measure the size the kernel actually
deduped, instead of assuming the kernel will only accept 16 MiB.
On current kernels this will always loop exactly once.
Since we now rely on `bytes_deduped`, make sure it has a sane value.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Apparently reinterpret_cast<uint64_t> sign-extends 32-bit pointers.
This is OK when running on a 32-bit kernel that will truncate the pointer
to 32 bits, but when running on a 64-bit kernel, the extra bits are
interpreted as part of the (now very invalid) address.
Use <uintptr_t> instead, which is unsigned, integer, and the same word
size as the arch's pointer type. Ordinary numeric conversion can take
it from there, filling the rest of the word with zeros.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Some malloc implementations will try to mmap() and munmap() large buffers
every time they are used, causing a severe loss of performance.
Nothing ever overrode the virtual methods, and there was no virtual
destructor, so they cause compiler warnings at build time when used with
a template that tries to delete pointers to them.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
crucible::VERSION doesn't make much sense now that libcrucible no
longer exists as a shared library. Nothing ever referenced it, so
it can go away.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
According to ioctl_iflags(2):
The type of the argument given to the FS_IOC_GETFLAGS and
FS_IOC_SETFLAGS operations is int *, notwithstanding the
implication in the kernel source file include/uapi/linux/fs.h
that the argument is long *.
So this code doesn't work on be64 machines.
Also, Valgrind complains about it.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
A subtle distinction, and not one that is particularly relevant to bees,
but it does make toolchains complain.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Another instance of the pattern where we derived a crucible class
from a btrfs struct. Make it an automatic variable instead.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We only use BtrfsExtentInfo when it's exactly equivalent to the
base, so drop the derived class.
While we're here, fix BtrfsExtentSame::add so it uses a btrfs-compatible
uint64_t instead of an off_t.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
It turns out I've been using pthread_setname_np wrong the whole time:
* on Linux, the thread name length is 15 characters.
TASK_COMM_LEN is 16 bytes, and the last one is always 0.
This is now hardcoded in many places and cannot be changed.
* pthread_setname_np doesn't return -errno, so DIE_IF_MINUS_ERRNO
was the wrong macro. On the other hand, we never want to do anything
differently when pthread_setname_np fails, so we never needed to
check the return value.
Also, libc silently ignores attempts to set the thread name when it is too
long. That's almost certainly a libc bug, but libc probably suppresses
the error result for the same reasons I ignore the error result.
Wrap the pthread_setname function with a C++ std::string overload that
truncates the argument at 15 characters, so we at least get the first
part of the task name in the thread name field. Later commits can deal
with making the bees thread names shorter.
Also wrap pthread_getname for symmetry.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
btrfs-tree provides classes for low-level access to btrfs tree objects.
An item class is provided to decode polymorphic btrfs item fields.
Several tree classes provide forward and backward iteration over raw
object items at different tree levels.
A csum tree class provides convenient access to csums by bytenr,
supporting all current btrfs csum types.
Wrapper classes for inode and subvol items provide direct access to
btrfs metadata fields without clumsy stat() wrappers or ioctls.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
We are using ByteVectors from multiple threads in some cases. Mostly
these are the status and progress threads which read the ByteVector
object references embedded in BEESNOTE macros.
Since it's not clear what the data race implications are, protect
the shared_ptr in ByteVector with a mutex for now.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Tasks are often running longer than 5 seconds (especially extents with
multiple references requiring copy operations), so the load tracking
algorithm needs to average several samples over a longer period of time
than 5 seconds. If the sample period is 60 seconds, we end up recomputing
the original load average from current_load, so skip the rounding error
and use the original load average value.
Arguably the real fix is to break up the more complex extent operations
over several downstream Task objects, but that's a more significant
design change.
Tweak the attack and decay rates so that threads are started a little
more slowly, but still stopped rapidly when load spikes up.
Remove the hysteresis to provide support for load average targets
below 1, or with fractional components, with a PWM-like effect.
Signed-off-by: Zygo Blaxell <bees@furryterror.org>