mirror of
https://github.com/Zygo/bees.git
synced 2025-05-17 13:25:45 +02:00
docs: add "what to do when something goes wrong" page
Standard crash backtrace collection, plus $BEESSTATUS for the high-level overview of what bees is doing. Signed-off-by: Zygo Blaxell <bees@furryterror.org>
This commit is contained in:
parent
7bba096077
commit
14b35e3426
@ -49,6 +49,7 @@ Recommended Reading
|
||||
* [bees Gotchas](docs/gotchas.md)
|
||||
* [btrfs kernel bugs](docs/btrfs-kernel.md)
|
||||
* [bees vs. other btrfs features](docs/btrfs-other.md)
|
||||
* [What to do when something goes wrong](docs/wrong.md)
|
||||
|
||||
More Information
|
||||
----------------
|
||||
|
@ -49,6 +49,7 @@ Recommended Reading
|
||||
* [bees Gotchas](gotchas.md)
|
||||
* [btrfs kernel bugs](btrfs-kernel.md)
|
||||
* [bees vs. other btrfs features](btrfs-other.md)
|
||||
* [What to do when something goes wrong](wrong.md)
|
||||
|
||||
More Information
|
||||
----------------
|
||||
|
160
docs/wrong.md
Normal file
160
docs/wrong.md
Normal file
@ -0,0 +1,160 @@
|
||||
What to do when something goes wrong with bees
|
||||
==============================================
|
||||
|
||||
Hangs and excessive slowness
|
||||
----------------------------
|
||||
|
||||
### Are you using qgroups or autodefrag?
|
||||
|
||||
Read about [bad btrfs feature interactions](btrfs-other.md).
|
||||
|
||||
### Use load-throttling options
|
||||
|
||||
If bees is just more aggressive than you would like, consider using
|
||||
[load throttling options](options.md). These are usually more effective
|
||||
than `ionice`, `schedtool`, and the `blkio` cgroup (though you can
|
||||
certainly use those too).
|
||||
|
||||
### Check `$BEESSTATUS`
|
||||
|
||||
If bees or the filesystem seems to be stuck, check the contents of
|
||||
`$BEESSTATUS`. bees describes what it is doing (and how long it has
|
||||
been trying to do it) through this file.
|
||||
|
||||
Sample:
|
||||
|
||||
<pre>
|
||||
THREADS (work queue 68 tasks):
|
||||
tid 20939: crawl_5986: dedup BeesRangePair: 512K src[0x9933f000..0x993bf000] dst[0x9933f000..0x993bf000]
|
||||
src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
||||
dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
||||
tid 20940: crawl_5986: dedup BeesRangePair: 512K src[0x992bf000..0x9933f000] dst[0x992bf000..0x9933f000]
|
||||
src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
||||
dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
||||
tid 21177: crawl_5986: dedup BeesRangePair: 512K src[0x9923f000..0x992bf000] dst[0x9923f000..0x992bf000]
|
||||
src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
||||
dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
||||
tid 21677: bees: [68493.1s] main
|
||||
tid 21689: crawl_transid: [236.508s] waiting 332.575s for next 10 transid RateEstimator { count = 87179, raw = 969.066 / 32229.2, ratio = 969.066 / 32465.7, rate = 0.0298489, duration(1) = 33.5021, seconds_for(1) = 1 }
|
||||
tid 21690: status: writing status to file '/run/bees.status'
|
||||
tid 21691: crawl_writeback: [203.456s] idle, dirty
|
||||
tid 21692: hash_writeback: [12.466s] flush rate limited after extent #17 of 64 extents
|
||||
tid 21693: hash_prefetch: [2896.61s] idle 3600s
|
||||
</pre>
|
||||
|
||||
The time in square brackets indicates how long the thread has been
|
||||
executing the current task (if this time is below 5 seconds then it
|
||||
is omitted). We can see here that the main thread (and therefore the
|
||||
bees process as a whole) has been running for 68493.1 seconds, the
|
||||
last hash table write was 12.5 seconds ago, and the last transid poll
|
||||
was 236.5 seconds ago. Three worker threads are currently performing
|
||||
dedupe on extents.
|
||||
|
||||
Thread names of note:
|
||||
|
||||
* `crawl_12345`: scan/dedupe worker threads (the number is the subvol
|
||||
ID which the thread is currently working on). These threads appear
|
||||
and disappear from the status dynamically according to the requirements
|
||||
of the work queue and loadavg throttling.
|
||||
* `bees`: main thread (doesn't do anything after startup, but its task execution time is that of the whole bees process)
|
||||
* `crawl_master`: task that finds new extents in the filesystem and populates the work queue
|
||||
* `crawl_transid`: btrfs transid (generation number) tracker and polling thread
|
||||
* `status`: the thread that writes the status reports to `$BEESSTATUS`
|
||||
* `crawl_writeback`: writes the scanner progress to `beescrawl.dat`
|
||||
* `hash_writeback`: trickle-writes the hash table back to `beeshash.dat`
|
||||
* `hash_prefetch`: prefetches the hash table at startup and updates `beesstats.txt` hourly
|
||||
|
||||
### Dump kernel stacks of hung processes
|
||||
|
||||
Check the kernel stacks of all blocked kernel processes:
|
||||
|
||||
ps xar | while read -r x y; do ps "$x"; head -50 --verbose /proc/"$x"/task/*/stack; done | tee lockup-stacks.txt
|
||||
|
||||
Submit the above information in your bug report.
|
||||
|
||||
### Check dmesg for btrfs stack dumps
|
||||
|
||||
Sometimes these are relevant too.
|
||||
|
||||
|
||||
bees Crashes
|
||||
------------
|
||||
|
||||
* If you have a core dump, run these commands in gdb and include
|
||||
the output in your report (you may need to post it as a compressed
|
||||
attachment, as it can be quite large):
|
||||
|
||||
(gdb) set pagination off
|
||||
(gdb) info shared
|
||||
(gdb) bt
|
||||
(gdb) thread apply all bt
|
||||
(gdb) thread apply all bt full
|
||||
|
||||
The last line generates megabytes of output and will often crash gdb.
|
||||
This is OK, submit whatever output gdb can produce.
|
||||
|
||||
**Note that this output may include filenames or data from your
|
||||
filesystem.**
|
||||
|
||||
* If the crash happens often (or you just want to be extra prepared),
|
||||
you can run automate the gdb data collection with this wrapper script:
|
||||
|
||||
<pre>
|
||||
#!/bin/sh
|
||||
set -x
|
||||
|
||||
# Move aside old core files for analysis
|
||||
for x in core*; do
|
||||
if [ -e "$x" ]; then
|
||||
mv -vf "$x" "old-$x.$(date +%Y-%m-%d-%H-%M-%S)"
|
||||
fi
|
||||
done
|
||||
|
||||
# Delete old core files after a week
|
||||
find old-core* -type f -mtime +7 -exec rm -vf {} + &
|
||||
|
||||
# Turn on the cores (FIXME: may need to change other system parameters
|
||||
# that capture or redirect core files)
|
||||
ulimit -c unlimited
|
||||
|
||||
# Run the command
|
||||
"$@"
|
||||
rv="$?"
|
||||
|
||||
# Don't clobber our core when gdb crashes
|
||||
ulimit -c 0
|
||||
|
||||
# If there were core files, generate reports for them
|
||||
for x in core*; do
|
||||
if [ -e "$x" ]; then
|
||||
gdb --core="$x" \
|
||||
--eval-command='set pagination off' \
|
||||
--eval-command='info shared' \
|
||||
--eval-command='bt' \
|
||||
--eval-command='thread apply all bt' \
|
||||
--eval-command='thread apply all bt full' \
|
||||
--eval-command='quit' \
|
||||
--args "$@" 2>&1 | tee -a "$x.txt"
|
||||
fi
|
||||
done
|
||||
|
||||
# Return process exit status to caller
|
||||
exit "$rv"
|
||||
</pre>
|
||||
|
||||
To use the wrapper script, insert it just before the `bees` command,
|
||||
as in:
|
||||
|
||||
gdb-wrapper bees /path/to/fs/
|
||||
|
||||
|
||||
Kernel crashes, corruption, and filesystem damage
|
||||
-------------------------------------------------
|
||||
|
||||
bees doesn't do anything that _should_ cause corruption or data loss;
|
||||
however, [btrfs has kernel bugs](btrfs-kernel.md) and [interacts poorly
|
||||
with some Linux block device layers](btrfs-other.md), so corruption is
|
||||
not impossible.
|
||||
|
||||
Issues with the btrfs filesystem kernel code or other block device layers
|
||||
should be reported to their respective maintainers.
|
Loading…
x
Reference in New Issue
Block a user