mirror of
https://github.com/Zygo/bees.git
synced 2025-06-17 01:56:16 +02:00
context: better detection for toxic extents
We detect toxic extents by measuring how long the LOGICAL_INO ioctl takes to run. If it is above some threshold, we consider the extent toxic, and blacklist it; otherwise, we process the extent normally. The detector was using the execution time of the ioctl, which detects toxic extents, but it also detects pauses of the bees process and transaction commit latency due to load. This leads to a significant number of false positives. The detection threshold was also very long, burning a lot of kernel CPU before the detection was triggered. Use the per-thread system CPU statistics to measure the kernel CPU usage of the LOGICAL_INO call directly. This is much more reliable because it is not confounded by other threads, and it's faster because we can set the time threshold two orders of magnitude lower. Also remove the lock and mutex added in "context: serialize LOGICAL_INO calls" because we theoretically no longer need it (but leave the code there with #if 0 in case we do need it in practice). Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit is contained in:
@ -88,11 +88,8 @@ const double BEES_DEFAULT_THREAD_FACTOR = 1.0;
|
||||
// Log warnings when an operation takes too long
|
||||
const double BEES_TOO_LONG = 5.0;
|
||||
|
||||
// Avoid any extent where LOGICAL_INO takes this long
|
||||
const double BEES_TOXIC_DURATION = 9.9;
|
||||
// EXPERIMENT: Kernel v4.14+ may let us ignore toxicity
|
||||
// NOPE: kernel 4.14 has the same toxicity problems as any previous kernel
|
||||
// const double BEES_TOXIC_DURATION = 99.9;
|
||||
// Avoid any extent where LOGICAL_INO takes this much kernel CPU time
|
||||
const double BEES_TOXIC_SYS_DURATION = 0.1;
|
||||
|
||||
// How long between hash table histograms
|
||||
const double BEES_HASH_TABLE_ANALYZE_INTERVAL = BEES_STATS_INTERVAL;
|
||||
|
Reference in New Issue
Block a user