mirror of
https://github.com/Zygo/bees.git
synced 2025-05-17 13:25:45 +02:00
Standard crash backtrace collection, plus $BEESSTATUS for the high-level overview of what bees is doing. Signed-off-by: Zygo Blaxell <bees@furryterror.org>
161 lines
6.2 KiB
Markdown
161 lines
6.2 KiB
Markdown
What to do when something goes wrong with bees
|
|
==============================================
|
|
|
|
Hangs and excessive slowness
|
|
----------------------------
|
|
|
|
### Are you using qgroups or autodefrag?
|
|
|
|
Read about [bad btrfs feature interactions](btrfs-other.md).
|
|
|
|
### Use load-throttling options
|
|
|
|
If bees is just more aggressive than you would like, consider using
|
|
[load throttling options](options.md). These are usually more effective
|
|
than `ionice`, `schedtool`, and the `blkio` cgroup (though you can
|
|
certainly use those too).
|
|
|
|
### Check `$BEESSTATUS`
|
|
|
|
If bees or the filesystem seems to be stuck, check the contents of
|
|
`$BEESSTATUS`. bees describes what it is doing (and how long it has
|
|
been trying to do it) through this file.
|
|
|
|
Sample:
|
|
|
|
<pre>
|
|
THREADS (work queue 68 tasks):
|
|
tid 20939: crawl_5986: dedup BeesRangePair: 512K src[0x9933f000..0x993bf000] dst[0x9933f000..0x993bf000]
|
|
src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
|
dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
|
tid 20940: crawl_5986: dedup BeesRangePair: 512K src[0x992bf000..0x9933f000] dst[0x992bf000..0x9933f000]
|
|
src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
|
dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
|
tid 21177: crawl_5986: dedup BeesRangePair: 512K src[0x9923f000..0x992bf000] dst[0x9923f000..0x992bf000]
|
|
src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
|
dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
|
|
tid 21677: bees: [68493.1s] main
|
|
tid 21689: crawl_transid: [236.508s] waiting 332.575s for next 10 transid RateEstimator { count = 87179, raw = 969.066 / 32229.2, ratio = 969.066 / 32465.7, rate = 0.0298489, duration(1) = 33.5021, seconds_for(1) = 1 }
|
|
tid 21690: status: writing status to file '/run/bees.status'
|
|
tid 21691: crawl_writeback: [203.456s] idle, dirty
|
|
tid 21692: hash_writeback: [12.466s] flush rate limited after extent #17 of 64 extents
|
|
tid 21693: hash_prefetch: [2896.61s] idle 3600s
|
|
</pre>
|
|
|
|
The time in square brackets indicates how long the thread has been
|
|
executing the current task (if this time is below 5 seconds then it
|
|
is omitted). We can see here that the main thread (and therefore the
|
|
bees process as a whole) has been running for 68493.1 seconds, the
|
|
last hash table write was 12.5 seconds ago, and the last transid poll
|
|
was 236.5 seconds ago. Three worker threads are currently performing
|
|
dedupe on extents.
|
|
|
|
Thread names of note:
|
|
|
|
* `crawl_12345`: scan/dedupe worker threads (the number is the subvol
|
|
ID which the thread is currently working on). These threads appear
|
|
and disappear from the status dynamically according to the requirements
|
|
of the work queue and loadavg throttling.
|
|
* `bees`: main thread (doesn't do anything after startup, but its task execution time is that of the whole bees process)
|
|
* `crawl_master`: task that finds new extents in the filesystem and populates the work queue
|
|
* `crawl_transid`: btrfs transid (generation number) tracker and polling thread
|
|
* `status`: the thread that writes the status reports to `$BEESSTATUS`
|
|
* `crawl_writeback`: writes the scanner progress to `beescrawl.dat`
|
|
* `hash_writeback`: trickle-writes the hash table back to `beeshash.dat`
|
|
* `hash_prefetch`: prefetches the hash table at startup and updates `beesstats.txt` hourly
|
|
|
|
### Dump kernel stacks of hung processes
|
|
|
|
Check the kernel stacks of all blocked kernel processes:
|
|
|
|
ps xar | while read -r x y; do ps "$x"; head -50 --verbose /proc/"$x"/task/*/stack; done | tee lockup-stacks.txt
|
|
|
|
Submit the above information in your bug report.
|
|
|
|
### Check dmesg for btrfs stack dumps
|
|
|
|
Sometimes these are relevant too.
|
|
|
|
|
|
bees Crashes
|
|
------------
|
|
|
|
* If you have a core dump, run these commands in gdb and include
|
|
the output in your report (you may need to post it as a compressed
|
|
attachment, as it can be quite large):
|
|
|
|
(gdb) set pagination off
|
|
(gdb) info shared
|
|
(gdb) bt
|
|
(gdb) thread apply all bt
|
|
(gdb) thread apply all bt full
|
|
|
|
The last line generates megabytes of output and will often crash gdb.
|
|
This is OK, submit whatever output gdb can produce.
|
|
|
|
**Note that this output may include filenames or data from your
|
|
filesystem.**
|
|
|
|
* If the crash happens often (or you just want to be extra prepared),
|
|
you can run automate the gdb data collection with this wrapper script:
|
|
|
|
<pre>
|
|
#!/bin/sh
|
|
set -x
|
|
|
|
# Move aside old core files for analysis
|
|
for x in core*; do
|
|
if [ -e "$x" ]; then
|
|
mv -vf "$x" "old-$x.$(date +%Y-%m-%d-%H-%M-%S)"
|
|
fi
|
|
done
|
|
|
|
# Delete old core files after a week
|
|
find old-core* -type f -mtime +7 -exec rm -vf {} + &
|
|
|
|
# Turn on the cores (FIXME: may need to change other system parameters
|
|
# that capture or redirect core files)
|
|
ulimit -c unlimited
|
|
|
|
# Run the command
|
|
"$@"
|
|
rv="$?"
|
|
|
|
# Don't clobber our core when gdb crashes
|
|
ulimit -c 0
|
|
|
|
# If there were core files, generate reports for them
|
|
for x in core*; do
|
|
if [ -e "$x" ]; then
|
|
gdb --core="$x" \
|
|
--eval-command='set pagination off' \
|
|
--eval-command='info shared' \
|
|
--eval-command='bt' \
|
|
--eval-command='thread apply all bt' \
|
|
--eval-command='thread apply all bt full' \
|
|
--eval-command='quit' \
|
|
--args "$@" 2>&1 | tee -a "$x.txt"
|
|
fi
|
|
done
|
|
|
|
# Return process exit status to caller
|
|
exit "$rv"
|
|
</pre>
|
|
|
|
To use the wrapper script, insert it just before the `bees` command,
|
|
as in:
|
|
|
|
gdb-wrapper bees /path/to/fs/
|
|
|
|
|
|
Kernel crashes, corruption, and filesystem damage
|
|
-------------------------------------------------
|
|
|
|
bees doesn't do anything that _should_ cause corruption or data loss;
|
|
however, [btrfs has kernel bugs](btrfs-kernel.md) and [interacts poorly
|
|
with some Linux block device layers](btrfs-other.md), so corruption is
|
|
not impossible.
|
|
|
|
Issues with the btrfs filesystem kernel code or other block device layers
|
|
should be reported to their respective maintainers.
|