1
0
mirror of https://github.com/Zygo/bees.git synced 2025-06-17 01:56:16 +02:00

context: better detection for toxic extents

We detect toxic extents by measuring how long the LOGICAL_INO ioctl takes
to run.  If it is above some threshold, we consider the extent toxic,
and blacklist it; otherwise, we process the extent normally.

The detector was using the execution time of the ioctl, which detects
toxic extents, but it also detects pauses of the bees process and
transaction commit latency due to load.  This leads to a significant
number of false positives.  The detection threshold was also very long,
burning a lot of kernel CPU before the detection was triggered.

Use the per-thread system CPU statistics to measure the kernel CPU usage
of the LOGICAL_INO call directly.  This is much more reliable because it
is not confounded by other threads, and it's faster because we can set
the time threshold two orders of magnitude lower.

Also remove the lock and mutex added in "context: serialize LOGICAL_INO
calls" because we theoretically no longer need it (but leave the code
there with #if 0 in case we do need it in practice).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit is contained in:
Zygo Blaxell
2018-10-31 21:12:16 -04:00
parent 9a97699dd9
commit 542371684c
2 changed files with 33 additions and 9 deletions

View File

@ -88,11 +88,8 @@ const double BEES_DEFAULT_THREAD_FACTOR = 1.0;
// Log warnings when an operation takes too long
const double BEES_TOO_LONG = 5.0;
// Avoid any extent where LOGICAL_INO takes this long
const double BEES_TOXIC_DURATION = 9.9;
// EXPERIMENT: Kernel v4.14+ may let us ignore toxicity
// NOPE: kernel 4.14 has the same toxicity problems as any previous kernel
// const double BEES_TOXIC_DURATION = 99.9;
// Avoid any extent where LOGICAL_INO takes this much kernel CPU time
const double BEES_TOXIC_SYS_DURATION = 0.1;
// How long between hash table histograms
const double BEES_HASH_TABLE_ANALYZE_INTERVAL = BEES_STATS_INTERVAL;