From 14b35e34265b10f25e572b9ad7ef273b6f4fb58d Mon Sep 17 00:00:00 2001
From: Zygo Blaxell <bees@furryterror.org>
Date: Thu, 4 Oct 2018 20:54:08 -0400
Subject: [PATCH] docs: add "what to do when something goes wrong" page

Standard crash backtrace collection, plus $BEESSTATUS for the high-level
overview of what bees is doing.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
---
 README.md     |   1 +
 docs/index.md |   1 +
 docs/wrong.md | 160 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 162 insertions(+)
 create mode 100644 docs/wrong.md
diff --git a/README.md b/README.md
index 7243b94..d3b6d08 100644
--- a/README.md
+++ b/README.md
@@ -49,6 +49,7 @@ Recommended Reading
  * [bees Gotchas](docs/gotchas.md)
  * [btrfs kernel bugs](docs/btrfs-kernel.md)
  * [bees vs. other btrfs features](docs/btrfs-other.md)
+ * [What to do when something goes wrong](docs/wrong.md)
 
 More Information
 ----------------
diff --git a/docs/index.md b/docs/index.md
index 13c5626..e9fee65 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -49,6 +49,7 @@ Recommended Reading
  * [bees Gotchas](gotchas.md)
  * [btrfs kernel bugs](btrfs-kernel.md)
  * [bees vs. other btrfs features](btrfs-other.md)
+ * [What to do when something goes wrong](wrong.md)
 
 More Information
 ----------------
diff --git a/docs/wrong.md b/docs/wrong.md
new file mode 100644
index 0000000..cf3d4c9
--- /dev/null
+++ b/docs/wrong.md
@@ -0,0 +1,160 @@
+What to do when something goes wrong with bees
+==============================================
+
+Hangs and excessive slowness
+----------------------------
+
+### Are you using qgroups or autodefrag?
+
+  Read about [bad btrfs feature interactions](btrfs-other.md).
+
+### Use load-throttling options
+
+  If bees is just more aggressive than you would like, consider using
+  [load throttling options](options.md).  These are usually more effective
+  than `ionice`, `schedtool`, and the `blkio` cgroup (though you can
+  certainly use those too).
+
+### Check `$BEESSTATUS`
+
+  If bees or the filesystem seems to be stuck, check the contents of
+  `$BEESSTATUS`.  bees describes what it is doing (and how long it has
+  been trying to do it) through this file.
+
+  Sample:
+
+<pre>
+THREADS (work queue 68 tasks):
+	tid 20939: crawl_5986: dedup BeesRangePair: 512K src[0x9933f000..0x993bf000] dst[0x9933f000..0x993bf000]
+src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
+dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
+	tid 20940: crawl_5986: dedup BeesRangePair: 512K src[0x992bf000..0x9933f000] dst[0x992bf000..0x9933f000]
+src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
+dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
+	tid 21177: crawl_5986: dedup BeesRangePair: 512K src[0x9923f000..0x992bf000] dst[0x9923f000..0x992bf000]
+src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
+dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
+	tid 21677: bees: [68493.1s] main
+	tid 21689: crawl_transid: [236.508s] waiting 332.575s for next 10 transid RateEstimator { count = 87179, raw = 969.066 / 32229.2, ratio = 969.066 / 32465.7, rate = 0.0298489, duration(1) = 33.5021, seconds_for(1) = 1 }
+	tid 21690: status: writing status to file '/run/bees.status'
+	tid 21691: crawl_writeback: [203.456s] idle, dirty
+	tid 21692: hash_writeback: [12.466s] flush rate limited after extent #17 of 64 extents
+	tid 21693: hash_prefetch: [2896.61s] idle 3600s
+</pre>
+
+ The time in square brackets indicates how long the thread has been
+ executing the current task (if this time is below 5 seconds then it
+ is omitted).  We can see here that the main thread (and therefore the
+ bees process as a whole) has been running for 68493.1 seconds, the
+ last hash table write was 12.5 seconds ago, and the last transid poll
+ was 236.5 seconds ago.  Three worker threads are currently performing
+ dedupe on extents.
+
+ Thread names of note:
+
+ * `crawl_12345`: scan/dedupe worker threads (the number is the subvol
+   ID which the thread is currently working on).  These threads appear
+   and disappear from the status dynamically according to the requirements
+   of the work queue and loadavg throttling.
+ * `bees`: main thread (doesn't do anything after startup, but its task execution time is that of the whole bees process)
+ * `crawl_master`: task that finds new extents in the filesystem and populates the work queue
+ * `crawl_transid`: btrfs transid (generation number) tracker and polling thread
+ * `status`: the thread that writes the status reports to `$BEESSTATUS`
+ * `crawl_writeback`: writes the scanner progress to `beescrawl.dat`
+ * `hash_writeback`: trickle-writes the hash table back to `beeshash.dat`
+ * `hash_prefetch`: prefetches the hash table at startup and updates `beesstats.txt` hourly
+
+### Dump kernel stacks of hung processes
+
+Check the kernel stacks of all blocked kernel processes:
+
+	ps xar | while read -r x y; do ps "$x"; head -50 --verbose /proc/"$x"/task/*/stack; done | tee lockup-stacks.txt
+
+Submit the above information in your bug report.
+
+### Check dmesg for btrfs stack dumps
+
+Sometimes these are relevant too.
+
+
+bees Crashes
+------------
+
+ * If you have a core dump, run these commands in gdb and include
+ the output in your report (you may need to post it as a compressed
+ attachment, as it can be quite large):
+
+        (gdb) set pagination off
+        (gdb) info shared
+        (gdb) bt
+        (gdb) thread apply all bt
+        (gdb) thread apply all bt full
+
+  The last line generates megabytes of output and will often crash gdb.
+  This is OK, submit whatever output gdb can produce.
+
+  **Note that this output may include filenames or data from your
+  filesystem.**
+
+ * If the crash happens often (or you just want to be extra prepared),
+   you can run automate the gdb data collection with this wrapper script:
+
+<pre>
+#!/bin/sh
+set -x
+
+# Move aside old core files for analysis
+for x in core*; do
+	if [ -e "$x" ]; then
+		mv -vf "$x" "old-$x.$(date +%Y-%m-%d-%H-%M-%S)"
+	fi
+done
+
+# Delete old core files after a week
+find old-core* -type f -mtime +7 -exec rm -vf {} + &
+
+# Turn on the cores (FIXME: may need to change other system parameters
+# that capture or redirect core files)
+ulimit -c unlimited
+
+# Run the command
+"$@"
+rv="$?"
+
+# Don't clobber our core when gdb crashes
+ulimit -c 0
+
+# If there were core files, generate reports for them
+for x in core*; do
+	if [ -e "$x" ]; then 
+		gdb --core="$x" \
+		--eval-command='set pagination off' \
+		--eval-command='info shared' \
+		--eval-command='bt' \
+		--eval-command='thread apply all bt' \
+		--eval-command='thread apply all bt full' \
+		--eval-command='quit' \
+		--args "$@" 2>&1 | tee -a "$x.txt"
+	fi
+done
+
+# Return process exit status to caller
+exit "$rv"
+</pre>
+
+  To use the wrapper script, insert it just before the `bees` command,
+  as in:
+
+    gdb-wrapper bees /path/to/fs/
+
+
+Kernel crashes, corruption, and filesystem damage
+-------------------------------------------------
+
+bees doesn't do anything that _should_ cause corruption or data loss;
+however, [btrfs has kernel bugs](btrfs-kernel.md) and [interacts poorly
+with some Linux block device layers](btrfs-other.md), so corruption is
+not impossible.
+
+Issues with the btrfs filesystem kernel code or other block device layers
+should be reported to their respective maintainers.