GGLinnk/bees - bees - Virtual World Git

mirror of https://github.com/Zygo/bees.git synced 2026-01-06 02:40:21 +01:00

Author	SHA1	Message	Date
Zygo Blaxell	f6908420ad	hash: handle $BEESHOME on non-btrfs bees explicitly supports storing $BEESHOME on another filesystem, and does not require that filesystem to be btrfs; however, if $BEESHOME is on a non-btrfs filesystem, there is an exception on every startup when trying to identify the subvol root of the hash table file in order to blacklist it, because non-btrfs filesystems don't have subvol roots. Fix by checking not only whether $BEESHOME is on btrfs, but whether it is on the _same_ btrfs, as the bees root, without throwing an exception. The hash table is blacklisted only when both filesystems are btrfs and have the same fsid. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-02-06 22:42:15 -05:00
Zygo Blaxell	30cd375d03	readahead: clean up the code, update docs Remove dubious comments and #if 0 section. Document new event counters, and add one for read failures. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-02-06 22:42:15 -05:00
Zygo Blaxell	48b7fbda9c	progress: adjust minimum thresholds for ETA to 10 seconds and 1 GiB of data 1% is a lot of data on a petabyte filesystem, and a long time to wait for an ETA. After 1 GiB we should have some idea of how fast we're reading the data. Increase the time to 10 seconds to avoid a nonsense result just after a scan starts. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-02-06 22:42:15 -05:00
Zygo Blaxell	874832dc58	openat2: log a warning when we fall back to openat This should occur only once per run, but it's worth leaving a note that it has happened. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-19 22:19:42 -05:00
Zygo Blaxell	5fe89d85c3	extent scan: make sure we run every extent crawler once per transaction There's a pathological case where all of the extent scan crawlers except one are at the end of a crawl cycle, but the one crawler that is still running is keeping the Task queue full. The result is that bees never starts the other extent scan crawlers, because the queue is always full at the instant a new transid triggers the start of a new scan. That's bad because it will result in bees falling behind when new data from the inactive size tiers appears. To fix this, check for throttling _after_ creating at least one scan task in each crawler. That will keep the crawlers running, and possibly allow them to claw back some space in the Task queue. It slightly overcommits the Task queue, so there will be a few more Tasks than nominally allowed. Also (re)introduce some hysteresis in the queue size limit and reduce it a little, so that bees isn't continually stopping and restarting crawls every time one task is created or completed, and so that we stay under the configured Task limit despite overcommitting. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-19 22:19:42 -05:00
Zygo Blaxell	a2b3e1e0c2	log: demote a lot of BEESLOGWARN to higher verbosity levels Toxic extent workarounds are going away because the underlying kernel bugs have been fixed. They are no longer worthy of spamming non-developer logs. INO_PATHS can return no paths if an inode has been deleted. It doesn't need a log message at all, much less one at WARN level. Dedupe failure can be INFO, the same level as dedupe itself, especially since the "NO dedupe" message doesn't mention what was [not] deduped. Inspired by Kai Krakow's "context: demote "abandoned toxic match" to debug log level". Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-19 01:08:28 -05:00
Kai Krakow	aaec931081	context: demote "abandoned toxic match" to debug log level This log message creates a overwhelmingly lot of messages in the system journal, leading to write-back flushing storms under high activity. As it is a work-around message, it is probably only useful to developers, thus demote to debug level. This fixes latency spikes in desktop usage after adding a lot of new files, especially since systemd-journal starts to flush caches if it sees memory pressure. Signed-off-by: Kai Krakow <kai@kaishome.de>	2025-01-19 00:59:22 -05:00
Zygo Blaxell	d4a681c8a2	Revert "roots: use a non-idle task for next_transid" next_transid tasks don't respect queue selection very well, because they effectively end up spinning in a loop until all other worker threads become busy. Back this out, and fix the priority handling in the Task library. This reverts commit `58db4071de`.	2025-01-12 18:48:33 -05:00
Zygo Blaxell	b8dd9a2db0	progress: put a timestamp in the bottom row This records the time when the progress data was calculated, to help indicate when the data might be very old. While we're here, move "now" out of the loop so there's only one value. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-11 23:39:55 -05:00
Zygo Blaxell	2f2a68be3d	roots: use openat2 instead of openat when available This increases resistance to symlink and mount attacks. Previously, bees could follow a symlink or a mount point in a directory component of a subvol or file name. Once the file is opened, the open file descriptor would be checked to see if its subvol and inode matches the expected file in the target filesystem. Files that fail to match would be immediately closed. With openat2 resolve flags, symlinks and mount points terminate path resolution in the kernel. Paths that lead through symlinks or onto mount points cannot be opened at all. Fall back to openat() if openat2() returns ENOSYS, so bees will still run on kernels before v5.6. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-09 02:26:53 -05:00
Zygo Blaxell	82f1fd8054	process: replace crucible::gettid() with a weak symbol Since we're now using weak symbols for dodgy libc functions, we might as well do it for gettid() too. Use the ::gettid() global namespace and let libc override it. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-09 01:37:44 -05:00
Zygo Blaxell	613ddc3c71	progress: rename "ctime" -> "tm_left" "ctime", an abbreviation of "cycle time", collides with "ctime", an abbreviation of "st_ctime", a well-known filesystem term. "tm_left" fits in the column, so use that. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-06 12:50:50 -05:00
Zygo Blaxell	c3a39b7691	progress: rework the progress table after github discussion * Report position within cycle in units that cannot be mistaken for size or percentage * Put the total/maximum values in their own row * Add a start time column * Change column titles to reference "cycles" * Use "idle" instead of "finished" when a crawler is not running * Replace "transid" with "gen" because it's shorter Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:45:37 -05:00
Zygo Blaxell	58db4071de	roots: use a non-idle task for next_transid The scanners which finish early can become stuck behind scanners that are able to keep the queue full. Switch the next_transid task to the normal Task queues so that we force scanners to restart on every new transaction, possibly deferring already queued work to do so. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:36:53 -05:00
Zygo Blaxell	0d3e13cc5f	context: report time in scan_one_extent Add yet another field to the scan/skip report line: the wallclock time used to process the extent ref. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:36:53 -05:00
Zygo Blaxell	1af5fcdf34	roots: don't access a shared variable after releasing a lock Access the local copy of `m_root_crawl_map` instead. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:36:53 -05:00
Zygo Blaxell	87472b6086	extent scan: don't put non-data block groups in the data extent map The total data size should not include metadata or system block groups, and already does not; however, we still have these block groups in the map for mapping the crawl pointer to a logical offset within the filesystem. Rearrange a few lines around the `if` statement so that the map doesn't contain anything it should not. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:32:48 -05:00
Zygo Blaxell	ca351d389f	extent scan: pick the right block groups for mixed-bg filesystems The progress indicator was failing on a mixed-bg filesystem because those filesystems have block groups which have both _DATA and _METADATA bits, and the filesystem size calculation was excluding block groups that have _METADATA set. It should exclude block groups that have _DATA not set. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:15:37 -05:00
Zygo Blaxell	1f0b8c623c	options: improve message when too many--or too few--path arguments given Running bees with no arguments complains about "Only one" path argument. Replace this with "Exactly one" which uses similar terminology to other btrfs tools. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:15:37 -05:00
Zygo Blaxell	74296c644a	options: return EXIT_SUCCESS after displaying help message `getopt_long` already supplies a message when an option cannot be parsed, so there isn't a need to distinguish option parse failures from help requests. Fixes: https://github.com/Zygo/bees/pull/277 Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:15:37 -05:00
Zygo Blaxell	231593bfbc	throttle: don't hold the multilock during throttle Release the lock before entering the throttle sleep, so that other threads can still run. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:15:37 -05:00
Zygo Blaxell	81bbf7e1d4	throttle: set default to 0.0 Longer latency testing runs are not showing a consistent gain from a throttle factor of 1.0. Make the default more conservative. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:15:37 -05:00
Zygo Blaxell	2a1ed0b455	throttle: track time values more closely Decaying averages by 10% every 5 minutes gives roughly a half-hour half-life to the rolling average. Speed that up to once per minute. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:14:31 -05:00
Zygo Blaxell	d160edc15a	throttle: add --throttle-factor option to control throttling factor Also change the initializer syntax for the option list to use C99 compound literals. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-01-03 23:13:51 -05:00
Zygo Blaxell	e79b242ce2	options: clean up the parser, prepare for new options with no short form We're not adding any more short options, but the debugging code doesn't work with optvals above 255. Also clean up constness and variable lifetimes. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-16 23:32:18 -05:00
Zygo Blaxell	ea45982293	throttle: add delays to match deferred request rate to btrfs completion rate Measure the time spent running various operations that extend btrfs transaction completion times (`LOGICAL_INO`, tmpfiles, and dedupe) and arrange for each operation to run for not less than the average amount of time by adding a sleep after each operation that takes less than the average. The delay after each operation is intended to slow down the rate of deferred and long-running requests from bees to match the rate at which btrfs is actually completing them. This may help avoid big spikes in latency if btrfs has so many requests queued that it has to force a commit to release memory. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-16 23:32:18 -05:00
Zygo Blaxell	f209cafcd8	bees: bump the file limits again, 512k files and 64k dirs Test machines keep blowing past the 32k file limit. 16 worker threads at 10,000 files each is much larger than 32k. Other high-FD-count services like DNS servers ask for million-file rlimits. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-16 22:54:12 -05:00
Zygo Blaxell	c4b31bdd5c	extent scan: no need for "No ref for extent" debug message While a snapshot is being deleted, there will be a continuous stream of "No ref for extent" messages. This is a common event that does not need to be reported. There is an analogous situation when a call to open() fails with ENOENT. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-14 15:02:39 -05:00
Zygo Blaxell	08fe145988	context: wait for btrfs send to finish, then try dedupe again Dedupe is not possible on a subvol where a btrfs send is running: BTRFS warning (device dm-22): cannot deduplicate to root 259417 while send operations are using it (1 in progress) btrfs informs a process with EAGAIN that a dedupe could not be performed due to a running send operation. It would be possible to save the crawler state at the affected point, fork a new crawler that avoids the subvol under send, and resume the crawler state after a successful dedupe is detected; however, this only helps the intersection of the set of users who have unrelated subvols that don't share extents, and the set of users who cannot simply delay dedupe until send is finished. The simplest approach is to simply stop and wait until the send goes away. The simplest approach is taken here. When a dedupe fails with EAGAIN, affected Tasks will poll, approximately once per transaction, until the dedupe succeeds or fails with a different error. bees dedupe performance corresponds with the availability of subvols that can accept dedupe requests. While the dedupe is paused, no new Tasks can be performed by the worker thread. If subvols are small and isolated from the bulk of the filesystem data, the result will be a small but partial loss of dedupe performance during the send as some worker threads get stuck on the sending subvol. If subvols heavily share extents with duplicate data in other subvols, worker threads will all become blocked, and the entire bees process will pause until at least some of the running sends terminate. During the polling for btrfs send, the dedupe Task will hold its dst file open. This open FD won't interfere with snapshot or file delete because send subvols are always read-only (it is not possible to delete a file on a RO subvol, open or otherwise) and send itself holds the affected subvol open, preventing its deletion. Once the send terminates, the dedupe will terminate soon after, and the normal FD release can occur. This pausing during btrfs send is unrelated to the `--workaround-btrfs-send` option, although `--workaround-btrfs-send` will cause the pausing to trigger less often. It applies to all scan modes. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-14 14:51:28 -05:00
Zygo Blaxell	bb09b1ab0e	roots: drop method `transid_re` There are no callers of this method any more, and it exposes more of BeesRoots than we really want things to have access to. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-13 23:19:43 -05:00
Zygo Blaxell	94d9945d04	roots: move the transid cache update into transid_max_nocache() All callers of the `transid_max_nocache` method update `m_transid_re` with the return value, so do that in `transid_max_nocache` itself. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-13 23:19:43 -05:00
Zygo Blaxell	b9abcceacb	progress: move the "finished" tag to a column where it won't obscure data The "done" pointer and the "%done" fields are still useful because they indicate _actual_ progress, not the work that has been _promised_. So it is possible for a crawl to be "finished" (all extents queued) but not "100.0000%" (some of those extents still active or in the queue). "deferred" state isn't particularly useful, so drop it. "finished" state implies no ETA, so that column is unused. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-12 23:10:15 -05:00
Zygo Blaxell	31f3a8d67d	progress: relabel the inaccurate ETA column ETA is calculated using a sample obtained by snooping on bees's normal crawling operations. This sample is heavily biased and not representative of the entire filesystem. If the distribution of extent sizes in the filesystem is not uniform, the ETA can be wildly wrong. Collecting an accurate sample set would require extra IO and CPU time which should be spent doing dedupes instead. Explicitly label the ETA as inaccurate to avoid having too many users report the same bug. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-12 23:10:15 -05:00
Zygo Blaxell	0580c10082	main: add support for pause (SIGUSR1) and resume (SIGUSR2) These are simple on/off switches for the task queue. They are lightweight requests for bees to be paused temporarily, but allow bees to release open files and save progress while paused. These signals are an alternative to SIGSTOP and SIGCONT, or using the cgroup freezer's FROZEN and THAWED states, which pause and resume the bees process, but do not allow the bees process to release open files or save progress. Snapshot and file deletes can occur on the filesystem while bees is paused by SIGUSR1 but not by SIGSTOP. These signals are also an alternative to SIGTERM and restart, which flush out the whole hash table and progress state on exit, and read the whole table back into memory on restart. This feature is experimental and may be replaced by a more general configuration or runtime control mechanism in the future. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-12 23:01:19 -05:00
Zygo Blaxell	e40339856f	readahead: use the right parameter order when checking the range In some cases the offset and size arguments were flipped when checking to see if a range had already been read. This would have been OK as long as the same mistake had been made consistently, since `bees_readahead_check` only does a cache lookup on the parameters, it doesn't try to use them to read a file. Alas, there was one case where the correct order was used, albeit a relatively rare one. Fix all the calls to use the correct order. Also fix a comment: the recent request cache is global to all threads. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-04 11:17:44 -05:00
Zygo Blaxell	3e89fe34ed	roots: avoid copying a BtrfsIoctlSearchKey Although all the members of BtrfsExtentDataFetcher are theoretically copiable, there's no need to actually make any such copy. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-03 16:54:14 -05:00
Zygo Blaxell	dc74766179	context: spell "progress" correctly Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-02 09:50:28 -05:00
Zygo Blaxell	3a33a5386b	context: add a PROGRESS: header in $BEESSTATUS Make it clearer where the progress information goes. Also add placeholder text so the progress section isn't empty at startup, when the progress hasn't been calculated yet. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 11:41:59 -05:00
Zygo Blaxell	7a197e2f33	bees: post-kernel-5.7 toxic extent handling Toxic extents are mostly gone in kernel 5.7 and later. Increase the timeout for toxic extent handling to reduce false positives, and remove persistenly stored toxic hashes from the hash table. Toxic hashes are still stored nonpersistently to help mitigate problems due to any remaining kernel bugs. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:52 -05:00
Zygo Blaxell	43d38ca536	extent scan: don't serialize dedupe and LOGICAL_INO when using extent scan mode The serialization doesn't seem to be necessary for the extent scan mode. No infinite loops in the kernel have been observed in the past two years, despite never having used MultiLock for the extent scanner. Leave the serialization for now on the subvol scanners. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:52 -05:00
Zygo Blaxell	8d4d153d1d	main: set default scan mode to mode 4 (EXTENT) Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	c1af219246	progress: squeeze the progress table into 80 columns or less We don't need the subvol numbers since they're only interesting to developers. We don't need both max and min sizes, pick one and drop the other. Replace "16E" with "max"--it is the same number of characters, but doesn't require the user to know what 1<<64 is off the top of their head. Shorten "remain" to "todo" because sometimes those extra two columns matter. Drop the seconds field in ETA timestamps. Long scan arrival times are years away, and short scan arrival times are only updated once every 5 minutes, so the extra precision isn't useful. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	9c183c2c22	progress: put the progress table in the stats and status files Make the progress information more accessible, without having to enable full debug log and fish it out of the stream with grep. Also increase the progress log level to INFO. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	59f8a467c3	extent scan: fix crawl_map creation There are two crawl_maps in extent scan's next_transid: one gets initialized, the other gets used. This works OK as long as bees is resuming an existing scan, because the two maps are identical; however, but it fails if bees is starting without an existing set of crawl data, and one of the two maps is empty or partially filled. The failure is intermittent, as the crawl map is being populated at the same time next_transid runs. It will eventually be completed after several transaction cycles, at which point bees runs normally. It does add significant delays during startup for benchmarks. There's only one crawl_map in extent scan, it always has the same crawlers, and extent scan's `next_transid` creates it by itself. Ignore the map from BeesRoots/BeesCrawl. Also throw in some missing but helpful trace statements. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	9987aa8583	progress: estimate actual data sizes for progress report Replace pointers in the "done" and "total" columns with estimated data sizes for each size tier. The estimation is based on statistics collected from extents scanned during the current bees run. Move the total size for the entire filesystem up to the heading. Report the _completed_ position (i.e. the one that would be saved in `beescrawl.dat`), not the _queued_ position (i.e. the one where the next Task would be created in memory). At the end of the data, the crawl pointer ends up at some random point in the filesystem just after the newest extent, so the progress gets to 99.7% and then goes to some random value like 47% or 3%, not to 100%. Report "deferred" in the "done" column when the crawler is waiting for the next transid, and "finished" in the "%done" column when the crawler has reached the end of the data. Suppress the ETA when finished. This makes it clear that there's no further work to do for these crawlers. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	8080abac97	extent scan: refactor BeesScanMode so derived classes decide their own scan scheduling BeesScanModeExtent uses six scan Tasks instead of one, which leads to awkwardness like the do_scan method to tell crawl_roots how to do what it shouldn't need to know how to do anyway. Move the crawl_roots logic into the ::scan methods themselves. This also deletes the very popular "crawl_more ran out of data" message. Extent scan explicitly indicates when a scan is complete, so there's no longer a need to fish this message out of the log. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	1e139d0ccc	extent scan: put all the refs in a single Task, sort them, use idle task The sorting avoids problematic read orders, like extent refs in the same inode with descending offsets, that btrfs is not optimized for. Putting everything in one Task keeps the queue sizes small, and manages the lock contention much more calmly. We only want to be mapping extent refs if there's not enough extents already in the queue to keep worker threads busy, so use the `idle()` method instead of `run()`. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	6542917ffa	extent scan: introduce SCAN_MODE_EXTENT The EXTENT scan mode reads the extent tree, splits it into tiers by extent size, converts each tiers's extents into subvol/inode/offset refs, then runs the legacy bees dedupe engine on the refs. The extent scan mode can cheaply compute completion percentage and ETA, so do that every time a new transid is observed. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	44810d6df8	scan_one_extent: remove the unreadahead after benchmark results That unreadahead used to result in a 10% hit on benchmarks. Now it's closer to 75%. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-11-30 23:30:33 -05:00
Zygo Blaxell	8f92b1dacc	BeesRangePair: drop the _really_ expensive toxic extent workaround We were doing a `LOGICAL_INO` ioctl on every _block_ of a matching extent, just to see how long it takes. It takes a while! This could be modified to do an ioctl with the `IGNORE_OFFSET` flag, once per new extent, but the kernel bug was fixed a long time ago, so we can start removing all the toxic extent code. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-11-30 23:30:33 -05:00

1 2 3 4 5 ...

341 Commits