1
0
mirror of https://github.com/Zygo/bees.git synced 2025-05-17 21:35:45 +02:00

context: serialize LOGICAL_INO calls

LOGICAL_INO can trip over the btrfs slow-backrefs bug, resulting in
some very long in-kernel runtimes.  If too many threads are executing
LOGICAL_INO then there may be no cores left on the system to run other
tasks.

Toxic extent detection is done by a very rudimentary algorithm which
can be confused by unrelated sources of latency within btrfs (especially
commit latency).  The algorithm can also be confused by other threads
executing the LOGICAL_INO ioctl.

These are two good reasons to prevent any two threads in a single bees
process instance from executing LOGICAL_INO at the same time, so let's
do that.

It is possible to limit the number of threads executing LOGICAL_INO with
the -c and -C options; however, this also limits the number of threads
which can perform any operation, while only LOGICAL_INO (*) has such a
profound effect on the rest of system operation.

Also make the status message clearer about exactly when LOGICAL_INO is
executed, as opposed to merely waiting to acquire a lock before executing
the ioctl.

(*) or maybe FILE_EXTENT_SAME.  The problem function that keeps showing
up in kernel stack traces is find_parent_nodes, which is called by both
the LOGICAL_INO and FILE_EXTENT_SAME ioctls.  We'll try this change
first and see if it prevents any recurrences of forced watchdog reboots;
if it does not, then we'll limit FILE_EXTENT_SAME the same way.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit is contained in:
Zygo Blaxell 2018-10-21 21:57:37 -04:00
parent 373b9ef038
commit 63ddbb9a4f

View File

@ -761,6 +761,15 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
{
THROW_CHECK1(invalid_argument, addr, !addr.is_magic());
THROW_CHECK0(invalid_argument, !!root_fd());
// There can be only one of these running at a time, or the slow
// backrefs bug will kill the whole system. Also it looks like there
// are so many locks held while LOGICAL_INO runs that there is no
// point in trying to run two of them on the same filesystem.
BEESNOTE("waiting to resolve addr " << addr);
static mutex s_resolve_mutex;
unique_lock<mutex> lock(s_resolve_mutex);
Timer resolve_timer;
// There is no performance benefit if we restrict the buffer size.
@ -768,6 +777,7 @@ BeesContext::resolve_addr_uncached(BeesAddress addr)
{
BEESTOOLONG("Resolving addr " << addr << " in " << root_path() << " refs " << log_ino.m_iors.size());
BEESNOTE("resolving addr " << addr << " with LOGICAL_INO");
if (log_ino.do_ioctl_nothrow(root_fd())) {
BEESCOUNT(resolve_ok);
} else {