From 63ddbb9a4f14c7302e4a5047d1be7d07e6508373 Mon Sep 17 00:00:00 2001 From: Zygo Blaxell Date: Sun, 21 Oct 2018 21:57:37 -0400 Subject: [PATCH] context: serialize LOGICAL_INO calls LOGICAL_INO can trip over the btrfs slow-backrefs bug, resulting in some very long in-kernel runtimes. If too many threads are executing LOGICAL_INO then there may be no cores left on the system to run other tasks. Toxic extent detection is done by a very rudimentary algorithm which can be confused by unrelated sources of latency within btrfs (especially commit latency). The algorithm can also be confused by other threads executing the LOGICAL_INO ioctl. These are two good reasons to prevent any two threads in a single bees process instance from executing LOGICAL_INO at the same time, so let's do that. It is possible to limit the number of threads executing LOGICAL_INO with the -c and -C options; however, this also limits the number of threads which can perform any operation, while only LOGICAL_INO (*) has such a profound effect on the rest of system operation. Also make the status message clearer about exactly when LOGICAL_INO is executed, as opposed to merely waiting to acquire a lock before executing the ioctl. (*) or maybe FILE_EXTENT_SAME. The problem function that keeps showing up in kernel stack traces is find_parent_nodes, which is called by both the LOGICAL_INO and FILE_EXTENT_SAME ioctls. We'll try this change first and see if it prevents any recurrences of forced watchdog reboots; if it does not, then we'll limit FILE_EXTENT_SAME the same way. Signed-off-by: Zygo Blaxell --- src/bees-context.cc | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/src/bees-context.cc b/src/bees-context.cc index 4b1f54f..7f99157 100644 --- a/src/bees-context.cc +++ b/src/bees-context.cc @@ -761,6 +761,15 @@ BeesContext::resolve_addr_uncached(BeesAddress addr) { THROW_CHECK1(invalid_argument, addr, !addr.is_magic()); THROW_CHECK0(invalid_argument, !!root_fd()); + + // There can be only one of these running at a time, or the slow + // backrefs bug will kill the whole system. Also it looks like there + // are so many locks held while LOGICAL_INO runs that there is no + // point in trying to run two of them on the same filesystem. + BEESNOTE("waiting to resolve addr " << addr); + static mutex s_resolve_mutex; + unique_lock lock(s_resolve_mutex); + Timer resolve_timer; // There is no performance benefit if we restrict the buffer size. @@ -768,6 +777,7 @@ BeesContext::resolve_addr_uncached(BeesAddress addr) { BEESTOOLONG("Resolving addr " << addr << " in " << root_path() << " refs " << log_ino.m_iors.size()); + BEESNOTE("resolving addr " << addr << " with LOGICAL_INO"); if (log_ino.do_ioctl_nothrow(root_fd())) { BEESCOUNT(resolve_ok); } else {