GGLinnk/bees - bees - Virtual World Git

mirror of https://github.com/Zygo/bees.git synced 2025-07-05 10:02:27 +02:00

Author	SHA1	Message	Date
Zygo Blaxell	0953160584	trace: export `exception_check` We need to call this from more than one place in bees. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-06-18 21:17:48 -04:00
Zygo Blaxell	51b3bcdbe4	trace: deprecate BEESLOGTRACE, align trace logs with exception notices Exceptions were logged at level NOTICE while the stack traces were logged at level DEBUG. That produced useless noise in the output with `-v5` or `-v6`, where there were exception headings logged, but no details. Fix that by placing the exceptions and traces at level DEBUG, but prefix them with `TRACE:` for easy grepping. Most of the events associated with BEESLOGTRACE either never happen, or they are harmless (e.g. trying to open deleted files or subvols). Reassign them to ordinary BEESLOGDEBUG, with one exception for unrecognized Extent flags that should be debugged if any appear. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-02-13 23:59:42 -05:00
Zygo Blaxell	ae58401d53	trace: avoid one copy in every trace function While investigating https://github.com/Zygo/bees/issues/282 I noticed that we're doing at least one unnecessary extra copy of the functor in BEESTRACE. Get rid of it with a const reference. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-02-13 23:59:42 -05:00
Zygo Blaxell	3e7eb43b51	BeesStringFile: figure out when to call--or _not_ call--fsync Older kernel versions featured some bugs in btrfs `fsync`, which could leave behind "ghost dirents", orphan filename items that did not have a corresponding inode. These dirents were created during log replay during the first mount after a crash due to several different bugs in the log tree and its use over the years. The last known bug of this kind was fixed in kernel 5.16. As of this writing, no fixes for this bug have been backported to any earlier LTS kernel. Some filesystems, including btrfs, will flush the contents of a new file before renaming it over an old file. On paper, btrfs can do this very cheaply since the contents of the new file are not referenced, and the old file not dereferenced, until a tree commit which includes both actions atomically; however, in real life, btrfs provides `fsync`-like semantics and uses the log-tree infrastructure to implement them, which compromises performance and acts as a magnet for bugs. The benefit of this trade-off is that `rename` can be used as a synchronization point for data outside of the btrfs, which would not happen if everything `rename` does was simply deferred to the next tree commit. The cost of this trade-off is that for the first 8 years of its existence, bees would trigger the bug so often that the project recommended its users put $BEESHOME in its own subvol to make it easy to remove ghost dirents left behind by the bug. Some other filesystems, such as xfs, don't have any special semantics for `rename`, and require `fsync` to avoid garbage or missing data after a crash. Even filesystems which do have a special case for `rename` can be configured to turn it off. btrfs will silently delete data from files in the event that an unrecoverable data block write error occurs. Kernel version 6.2 adds important new and unexpected cases where this can happen on filesystems using raid56 data, but it also happens in all usable btrfs versions (the silent deletion behavior was introduced in kernel version 3.9). Unrecoverable write errors are currently reported to userspace only through `fsync`. Since the failed extents are deleted, they cannot be detected via csum failures or scrub after the fact--and it's too late by then, the data is already gone. `fsync` is the last opportunity to detect the write failure before the `rename`. If the error is not detected, the contents of the file will be silently discarded in btrfs. The impact on bees is that scans will abruptly restart from zero after a crash combined with some other reasonably common failures. Putting all of this together leads to a rather complex workaround: if the filesystem under $BEESHOME (specifically, the filesystem where BeesStringFile objects such as `beescrawl.dat` are written) is a btrfs filesystem, and the host kernel is a version prior to 5.16, then don't call `fsync` before `rename`. In all other cases, do call `fsync`, and prevent dependent writes (i.e. the following `rename`) in the event of errors. Since present kernel versions still require `fsync`, we don't need an upper bound on the kernel version check until someone fixes btrfs `rename` (or perhaps adds a flag to `renameat2` which prevents use of the log tree) in the kernel. Once that fix happens, we can drop the `fsync` call for kernels after that fixed version. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-02-10 21:04:20 -05:00
Zygo Blaxell	183b6a5361	extent scan: refactor BeesCrawl, BeesScanMode* The main gains here are: * Move extent tree searches into BeesScanModeExtent so that they are not slowed down by the BeesCrawl code, which was designed for the much more specialized metadata in subvol trees. * Enable short extent skipping now that BeesCrawl is out of the way. * Stop enumerating btrfs subvols when in extent scan mode. All this gets rid of >99% of unnecessary extent tree searches. Incremental extent scan cycles now finish in milliseconds instead of minutes. BeesCrawl was never designed to cope with the structure and content of the extent tree. It would waste thousands of tree-search ioctl calls reading and ignoring metadata items. Performance was particularly bad when a binary search was involved, as any binary search probe that landed in a metadata block group would read and discard all the metadata items in the block group, sequentially, repeated for each level of the binary search. This was blocking implementation of short extent skipping optimization for large extent size tiers, because the skips were using thousands of tree searches to skip over only a few hundred extent items. Extent scan also had to read every extent item twice to do the transid filtering, because BeesCrawl's interface discarded the relevant information when it converted a `BtrfsTreeItem` into a `BeesFileRange`. The cost of this extra fetch was negligible, but it could have been zero. Fix this by: * Copy the equivalent of `fetch_extents` from BeesCrawl into `BeesScanModeExtent`, then give each of the extent scan crawlers its own `BtrfsDataExtentTreeFetcher` instance. This enables extent tree searches to avoid pure (non-mixed) metadata block groups. `BeesCrawl` is now used only for its interface to `BeesRoots` for saving state in `beescrawl.dat`, and never to determine the next extent tree item. * Move subvol-specific parts of `BeesRoots` into a new class `BeesScanModeSubvol` so that `BtrfsScanModeExtent` doesn't have to enable or support them. In particular, `bees -m4` no longer enumerates all of the _subvol_ crawlers. `BeesRoots` is still used to save and load crawl state. * Move several members from `BtrfsScanModeExtent` into a per-crawler state object `SizeTier` to eliminate the need for some locks and to maintain separate cache state for `BtrfsDataExtentTreeFetcher`. * Reuse the `BtrfsTreeItem` to get the generation field for the transid range filter. * Avoid a few corner cases when handling errors, where extent scan might drop an extent without scanning it, or fail to advance to the next extent. * Enable the extent-skipping algorithm for large size tiers, now that `BeesCrawl::fetch_extents` is no longer slowing it down. * Add a debug stream interface which developers can easily turn on when needed to inspect the decisions that extent scan is making. * Track metrics that are more useful, particularly searches per extent scanned, and fraction of extents that are skipped. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2025-02-06 22:43:22 -05:00
Zygo Blaxell	ea45982293	throttle: add delays to match deferred request rate to btrfs completion rate Measure the time spent running various operations that extend btrfs transaction completion times (`LOGICAL_INO`, tmpfiles, and dedupe) and arrange for each operation to run for not less than the average amount of time by adding a sleep after each operation that takes less than the average. The delay after each operation is intended to slow down the rate of deferred and long-running requests from bees to match the rate at which btrfs is actually completing them. This may help avoid big spikes in latency if btrfs has so many requests queued that it has to force a commit to release memory. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-16 23:32:18 -05:00
Zygo Blaxell	f209cafcd8	bees: bump the file limits again, 512k files and 64k dirs Test machines keep blowing past the 32k file limit. 16 worker threads at 10,000 files each is much larger than 32k. Other high-FD-count services like DNS servers ask for million-file rlimits. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-16 22:54:12 -05:00
Zygo Blaxell	08fe145988	context: wait for btrfs send to finish, then try dedupe again Dedupe is not possible on a subvol where a btrfs send is running: BTRFS warning (device dm-22): cannot deduplicate to root 259417 while send operations are using it (1 in progress) btrfs informs a process with EAGAIN that a dedupe could not be performed due to a running send operation. It would be possible to save the crawler state at the affected point, fork a new crawler that avoids the subvol under send, and resume the crawler state after a successful dedupe is detected; however, this only helps the intersection of the set of users who have unrelated subvols that don't share extents, and the set of users who cannot simply delay dedupe until send is finished. The simplest approach is to simply stop and wait until the send goes away. The simplest approach is taken here. When a dedupe fails with EAGAIN, affected Tasks will poll, approximately once per transaction, until the dedupe succeeds or fails with a different error. bees dedupe performance corresponds with the availability of subvols that can accept dedupe requests. While the dedupe is paused, no new Tasks can be performed by the worker thread. If subvols are small and isolated from the bulk of the filesystem data, the result will be a small but partial loss of dedupe performance during the send as some worker threads get stuck on the sending subvol. If subvols heavily share extents with duplicate data in other subvols, worker threads will all become blocked, and the entire bees process will pause until at least some of the running sends terminate. During the polling for btrfs send, the dedupe Task will hold its dst file open. This open FD won't interfere with snapshot or file delete because send subvols are always read-only (it is not possible to delete a file on a RO subvol, open or otherwise) and send itself holds the affected subvol open, preventing its deletion. Once the send terminates, the dedupe will terminate soon after, and the normal FD release can occur. This pausing during btrfs send is unrelated to the `--workaround-btrfs-send` option, although `--workaround-btrfs-send` will cause the pausing to trigger less often. It applies to all scan modes. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-14 14:51:28 -05:00
Zygo Blaxell	bb09b1ab0e	roots: drop method `transid_re` There are no callers of this method any more, and it exposes more of BeesRoots than we really want things to have access to. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-13 23:19:43 -05:00
Zygo Blaxell	7a197e2f33	bees: post-kernel-5.7 toxic extent handling Toxic extents are mostly gone in kernel 5.7 and later. Increase the timeout for toxic extent handling to reduce false positives, and remove persistenly stored toxic hashes from the hash table. Toxic hashes are still stored nonpersistently to help mitigate problems due to any remaining kernel bugs. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:52 -05:00
Zygo Blaxell	9c183c2c22	progress: put the progress table in the stats and status files Make the progress information more accessible, without having to enable full debug log and fish it out of the stream with grep. Also increase the progress log level to INFO. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	9987aa8583	progress: estimate actual data sizes for progress report Replace pointers in the "done" and "total" columns with estimated data sizes for each size tier. The estimation is based on statistics collected from extents scanned during the current bees run. Move the total size for the entire filesystem up to the heading. Report the _completed_ position (i.e. the one that would be saved in `beescrawl.dat`), not the _queued_ position (i.e. the one where the next Task would be created in memory). At the end of the data, the crawl pointer ends up at some random point in the filesystem just after the newest extent, so the progress gets to 99.7% and then goes to some random value like 47% or 3%, not to 100%. Report "deferred" in the "done" column when the crawler is waiting for the next transid, and "finished" in the "%done" column when the crawler has reached the end of the data. Suppress the ETA when finished. This makes it clear that there's no further work to do for these crawlers. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	8080abac97	extent scan: refactor BeesScanMode so derived classes decide their own scan scheduling BeesScanModeExtent uses six scan Tasks instead of one, which leads to awkwardness like the do_scan method to tell crawl_roots how to do what it shouldn't need to know how to do anyway. Move the crawl_roots logic into the ::scan methods themselves. This also deletes the very popular "crawl_more ran out of data" message. Extent scan explicitly indicates when a scan is complete, so there's no longer a need to fish this message out of the log. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	6542917ffa	extent scan: introduce SCAN_MODE_EXTENT The EXTENT scan mode reads the extent tree, splits it into tiers by extent size, converts each tiers's extents into subvol/inode/offset refs, then runs the legacy bees dedupe engine on the refs. The extent scan mode can cheaply compute completion percentage and ETA, so do that every time a new transid is observed. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-12-01 00:17:51 -05:00
Zygo Blaxell	24b08ef7b7	scan_one_extent: eliminate nuisance dedupes, drop caches after reading data A laundry list of problems fixed: * Track which physical blocks have been read recently without making any changes, and don't read them again. * Separate dedupe, split, and hole-punching operations into distinct planning and execution phases. * Keep the longest dedupe from overlapping dedupe matches, and flatten them into non-overlapping operations. * Don't scan extents that have blocks already in the hash table. We can't (yet) touch such an extent without making unreachable space. Let them go. * Give better information in the scan summary visualization: show dedupe range start and end points (<ddd>), matching blocks (=), copy blocks (+), zero blocks (0), inserted blocks (.), unresolved match blocks (M), should-have-been-inserted-but-for-some-reason-wasn't blocks (i), and there's-a-bug-we-didn't-do-this-one blocks (#). * Drop cached data from extents that have been inserted into the hash table without modification. * Rewrite the hole punching for uncompressed extents, which apparently hasn't worked properly since the beginning. Nuisance dedupe elimination: * Don't do more than 100 dedupe, copy, or hole-punch operations per extent ref. * Don't split an extent or punch a hole unless dedupe would save at least half of the extent ref's size. * Write a "skip:" summary showing the planned work when nuisance dedupe elimination decides to skip an extent. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-11-30 23:30:33 -05:00
Zygo Blaxell	97eab9655c	types: add shrink_begin and shrink_end methods for BeesFileRange and BeesRangePair These allow trimming of overlapping dedupes. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-11-30 23:30:33 -05:00
Zygo Blaxell	33cde5de97	bees: increase file cache size limits With some extents having 9999 refs, we can use much larger caches for file descriptors. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-11-30 23:30:33 -05:00
Zygo Blaxell	8bac00433d	bees: reduce extent ref limit to 9999 Originally the limit was 2730 (64KiB worth of ref pointers). This limit was a little too low for some common workloads, so it was then raised by a factor of 256 to 699050, but there are a lot of problems with extent counts that large. Most of those problems are memory usage and speed problems, but some of them trigger subtle kernel MM issues. 699050 references is too many to be practical. Set the limit to 9999, only 3-4x larger than the original 2730, to give up on deduplication when each deduped ref reduces the amount of space by no more than 0.01%. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-11-30 23:30:33 -05:00
Zygo Blaxell	8d08a3c06f	readahead: inject some sanity at the foundation of an insane architecture This solves some of the worst problems with bees reads: 1. The kernel readahead doesn't work. More precisely, it's much better adapted for a very different use case: a single thread alternating between reading a file sequentially and processing the data that was read. bees has multiple threads which compete for access to IO and then issue reads in random order immediately after the call to readahead. The kernel uses idle ioprio scheduling for the readaheads, so the readaheads get preempted by the random reads, or cancels the readaheads because the data access pattern isn't sequential after the readahead was issued. 2. Seeking drives perform terribly with multiple competing readers, especially with btrfs striped profiles where the iops are broken into tiny stripe-sized pieces. At one point I intended to read the btrfs device map and figure out which devices can be read in parallel, but to make that useful, the user needs to have an array with multiple drives in single profile, or 4+ drives in raid1 profile. In all other cases, the elaborate calculations always return the same result: there can be only one reader at a time. This commit fixes both problems: 1. Don't use the kernel readahead. Use normal reads into a dummy buffer instead. 2. Allow only one thread to readahead at any time. Once the read is completed, the data is in the page cache, and all the random-order small reads that bees does will hit the page cache, not a spinning disk. In some cases we need to read two things close together, so add a `bees_readahead_pair` which holds one lock across both reads. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-11-30 23:30:33 -05:00
Zygo Blaxell	a7baa565e4	crawl: rename next_transid() to avoid confusion with BeesScanMode::next_transid() Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-11-30 23:30:33 -05:00
Zygo Blaxell	b408eac98e	trace: add file and line numbers all the way up the stack These were added to crucible all the way back in 2018 (`1beb61fb78` "crucible: error: record location of exception in what() message") but it's even more useful in the stack tracer in bees. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2024-11-30 23:27:24 -05:00
Zygo Blaxell	3430f16998	context: create a Pool of BtrfsIoctlLogicalInoArgs objects Each object contains a 16 MiB buffer, which is very heavy for some malloc implementations. Keep the objects in a Pool so that their buffers are only allocated and deallocated once in the process lifetime. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2023-02-23 22:45:31 -05:00
Zygo Blaxell	d5a99c2f5e	roots: don't share a RootFetcher between threads If the send workaround is enabled, it is possible for two threads (a thread running the crawl_new task, and a thread attempting to apply the send workaround) to access the same RootFetcher object at the same time. That never ends well. Give each function its own BtrfsRootFetcher object. Fixes: https://github.com/Zygo/bees/issues/250 Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2023-02-20 11:14:34 -05:00
Zygo Blaxell	849c071146	hash: flush the table more slowly With SIGTERM and fast exit, the trickle writeback is less important. We don't want to flood people's IO subsystems with continuous writes. This really should be configurable at runtime. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2023-01-27 22:16:02 -05:00
Zygo Blaxell	cbc76a7457	hash: don't spin when writes fail When a hash table write fails, we skip over the write throttling because we didn't report that we successfully wrote an extent. This can be bad if the filesystem is full and the allocations for writes are burning a lot of CPU time searching for free space. We also don't retry the write later on since we assume the extent is clean after a write attempt whether it was successful or not, so the extent might not be written out later when writes are possible again. Check whether a hash extent is dirty, and always throttle after attempting the write. If a write fails, leave the extent dirty so we attempt to write it out the next time flush cycles through the hash table. During shutdown this will reattempt each failing write once, after that the updated hash table data will be dropped. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2023-01-23 00:09:26 -05:00
Zygo Blaxell	c3b664fea5	context: don't forget to retry locked extents The caller of scan_forward has to stop advancing the BeesFileCrawl position when an extent lock blocks a scan, so that it will resume from the same position when the Task is scheduled again; otherwise, bees simply skips over the extent and leave it incompletely deduped. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-22 23:46:36 -05:00
Zygo Blaxell	bbcfd9daa6	roots: replace BEES_TRANSID_FACTOR with BEES_TRANSID_POLL_INTERVAL Restart crawl_more (and update crawl roots and flush FD caches) every time the transid changes, and only when the transid changes, but not more often than a reasonable minimum poll interval. Clean up the log message: use the proper thread name and remove the wildly inaccurate estimate of when crawl will resume. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:51:01 -05:00
Zygo Blaxell	d6d3e1045e	context: keep the resolve cache smaller We don't need to cache 65536 extent maps, especially if each one can have almost 700K references. Valgrind's massif tool points to the extent map cache as a very large memory allocator, but test runs with memcg disagree. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:51:01 -05:00
Zygo Blaxell	d5d17cbe62	roots: run insert_new_crawl from within a Task If we have loadavg targeting enabled, there may be no worker threads available to respond to new subvols, so we should not bother updating the subvols list. Put insert_new_crawl into a Task so it only executes when a worker is available. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:51:01 -05:00
Zygo Blaxell	03f809bf22	roots: reimplement scan modes using virtual base and methods Split each scan mode into two distinct phases: 1. A heavy discovery phase, where we search the entire filesystem for something (new items in subvol trees in this case). 2. A light consuming phase, where we fetch extents to dedupe from places that we found in the discovery phase. Part 1 recomputes the subvol ordering every time there is a new transid. For some scan modes this computation is quite expensive, far too costly to pay for every extent, so we do it no more than once per transaction. Part 2 is run every time a worker thread hits the crawl_more Task. It simply pulls one extent from the first crawler off a sorted list, removing the crawler from the list when the crawler runs out of data. Part 1 creates a new structure and swaps it into place, while Part 2 continues to run using the previous strucuture. Neither of these need to block the other, so they don't. The separate class and base pointer also make it easer to add new scan modes that are not based on subvol trees or that don't use BeesCrawl. While we're here, fix up some method visibility in BeesRoots. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:51:01 -05:00
Zygo Blaxell	0dca6f74b0	roots: remove duplicate default scan mode setting Set the constructor's default scan mode to an invalid mode, so if we change the default, we don't have to update two places. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:51:01 -05:00
Zygo Blaxell	f5c4714a28	roots: add 'recent' crawl mode for a mix of new and old data Crawl mode 3 'recent' prioritizes data from new updates to previously scanned subvols over subvols that have not been completely scanned yet. If no such new data exists, falls back to a variation of 'lockstep' scan mode. This enables us to keep up with new data as it arrives, a key weakness of all the other scan modes, and worth violating our unwritten "no new scan modes until we have extent-tree dedupe working" policy for. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:51:00 -05:00
Zygo Blaxell	84f91af503	context: don't let multiple worker Tasks get stuck on a single extent or inode When two Tasks attempt to lock the same extent, append the later Task to the earlier Task's post-exec work queue. This will guarantee that all Tasks which attempt to manipulate the same extent will execute sequentially, and free up threads to process other extents. Similarly, if two scanner threads operate on the same inode, any dedupe they perform will lock out other scanner threads in btrfs. Avoid this by serializing Task objects that reference the same file. This does theoretically use an unbounded amount of memory, but in practice a Task that encounters a contended extent or inode quickly stops spawning new Tasks that might increase the queue size, and all Tasks that might contend for the same lock(s) end up on a single FIFO queue. Note that the scope of inode locks is intentionally global, i.e. when an inode is locked, it locks every inode with the same number in every subvol. This avoids significant lock contention and task queue growth when the same inode with the same file extents appear in snapshots. Fixes: https://github.com/Zygo/bees/issues/158 Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:51:00 -05:00
Zygo Blaxell	31d26bcfc6	roots: organize scan workers by inode instead of extent Split crawlers into two separate Tasks: 1. a Task which locates the next inode with a new data extent. 2. a Task which scans every new extent in that inode. This simplifies some lock contention and execution ordering issues. Files are read sequentially. Workers dynamically scale up or down as needed, without creating thousands of deferred Task objects. Workers obtain inode locks for different inodes in btrfs, so they can work in parallel instead of waiting for each other. This change in behavior comes with new names for the worker Tasks: "crawl_master" is now "crawl_more", the singular Task which creates inode-scanning Tasks. "crawl_<subvol>" is now "crawl_<subvol>_<inode>". Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:51:00 -05:00
Zygo Blaxell	e13c62084b	roots: use scan mode 'independent' by default Independent subvol scanners fairly consistently outperform either of the correlated scan modes. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:51:00 -05:00
Zygo Blaxell	7cef1133be	roots: use symbolic names for SCAN_MODEs This was done on the development branch three years ago, and has been creating annoying merge conflicts ever since. Sync up the branches so they have the same names for these. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:51:00 -05:00
Zygo Blaxell	f98599407f	roots: rework btrfs send workaround using btrfs-tree Drop the cache since we no longer have to open a file every time we check a subvol's status. Also stop counting workaround events at the root level twice. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:50:59 -05:00
Zygo Blaxell	23c16aa978	BeesFileRange: coalesce is not used, subtract was never implemented Less dead code to maintain. Also more Doxygen comments. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:50:59 -05:00
Zygo Blaxell	9cdeb608f5	bees: drop the balance/logical workaround that has been disabled for two years Kernels that needed the balance workaround frankly are too buggy to run bees at all. The workaround also makes the locking stories around logical_ino calls and process exit complicated, so get rid of it completely. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:50:58 -05:00
Zygo Blaxell	31b2aa3c0d	context: speed up orderly process termination Quite often bees exceeds its service timeout for termination because it is waiting for a loop embedded in a Task to finish some long-running btrfs operation. This can cause bees to be aborted by SIGKILL before it can completely flush the hash table or save crawl state. There are only two important things SIGTERM does when bees terminates: 1. Save crawl progress 2. Flush out the hash table Everything else is automatically handled by the kernel when the process is terminated by SIGKILL, so we don't have to bother doing it ourselves. This can save considerable time at shutdown since we don't have to wait for every thread to reach a point where it becomes idle, or force loops to terminate by throwing exceptions, or check a condition every time we access a pointer. Instead, we need do only the things in the list above, and then call _exit() to clean up everything else. Hash table and crawl state writeback can happen in their background threads instead of the foreground one. Separate the "stop" method for these classes into "stop_request" and "stop_wait" so that these writebacks can run at the same time. Deprecate and remove all references to the BeesHalt exception, and remove several unnecessary checks for BeesContext::stop_requested. Pause the task queue instead of cancelling it, which preserves the crawl progress state and stops new Tasks from competing for iops and CPU during writeback. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:50:58 -05:00
Zygo Blaxell	a2e1887c52	bees: use MultiLocker to serialize dedupe and logical_ino In current kernels there is a bug which leads to an infinite loop in add_all_parents(). The bug is triggered by one thread running dedupe while another runs logical_ino. Work around this by ensuring that bees process never runs dedupe and logical_ino ioctls at the same time. Any number of either can run at the same time, but not one of both. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:50:55 -05:00
Zygo Blaxell	cc87125e41	bees: drop bees_sync, we will not need it bees_sync() was an exception-trapping wrapper around fsync() which is not needed in any of the contexts from which it was called: 1. dedupe operations implicitly flush the src data, so there is no need to call fsync() to do that twice. 2. crawl position is written to a temporary file and renamed over the original, which always forces a flush when the original exists. On the first write, where there is no original, a crash would result in starting over with an empty or hole-filled beescrawl file, which is the initial state of bees. There is also a long history of kernel bugs triggered by fsync() in this case. 3. we use unreadahead to trigger writeback for flushing the hash table to persistent storage. Here is a space where we might use fsync after all, as part of bees_unreadahead's emulation of POSIX_FADV_DONTNEED, but we need to get read-once behavior from the scanner before we can use this capability. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:50:54 -05:00
Zygo Blaxell	be9321cdb3	roots: correctly track crawl dirty state If there's an error while writing the crawl state, the state should remain dirty. If the crawl state is successfully written, the state is only clean if there were no changes to crawl state since the write was committed. We need to release the lock while writing the state but correctly set the dirty flag when the state is written successfully. Replace the bool with a version number counter. Track the last version successfully saved and the current version of the crawl state. The state is dirty if these counters disagree and clean if they agree. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:50:54 -05:00
Zygo Blaxell	a9c81e5531	bees: drop m_parent_ctx It has not been used since 2016. Also drop the explicit default constructor. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-12-20 20:50:54 -05:00
Zygo Blaxell	3654738f56	bees: fix deprecated-copy warnings for clang-14 Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2022-10-23 22:39:59 -04:00
Zygo Blaxell	fbf6b395c8	types: member m_fd in BeesFileRange must be protected against data races We had an unfortunate pattern of: const BeesFileRange bfr; shared_ptr<BeesContext> ctx; // ... BEESNOTE("foo " << bfr); bfr.fd(ctx); BEESNOTE("foo after opening: " << bfr); If dump_status started running after the first BEESNOTE, but before the second, then bfr.fd() might expose a single Fd object's shared_ptr member to two threads at the same time (the thread running dump_status and the thread running BEESNOTE) without protection by a lock. One of the threads would see a partially-initialized Fd object, and the other thread would crash on an assertion failure, e.g. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007f4c4fde5537 in __GI_abort () at abort.c:79 #2 0x00007f4c4fde540f in __assert_fail_base (fmt=0x7f4c4ff4e128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5557605629dd "!m_destroyed", file=0x5557605627c0 "../include/crucible/namedptr.h", line=77, function=<optimized out>) at assert.c:92 #3 0x00007f4c4fdf4662 in __GI___assert_fail (assertion=assertion@entry=0x5557605629dd "!m_destroyed", file=file@entry=0x5557605627c0 "../include/crucible/namedptr.h", line=line@entry=77, function=function@entry=0x555760562970 "crucible::NamedPtr<Return, Arguments>::Value::~Value() [with Return = crucible::IOHandle; Arguments = {int}]") at assert.c:101 #4 0x00005557605306f6 in crucible::NamedPtr<crucible::IOHandle, int>::Value::~Value (this=0x7f4a3c2ff0d0, __in_chrg=<optimized out>) at ../include/crucible/namedptr.h:77 #5 0x00005557605137da in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f4a3c2ff0c0) at /usr/include/c++/10/bits/shared_ptr_base.h:151 #6 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f4a3c2ff0c0) at /usr/include/c++/10/bits/shared_ptr_base.h:151 #7 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f4c4c5b5f28, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733 #8 std::__shared_ptr<crucible::IOHandle, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183 #9 std::shared_ptr<crucible::IOHandle>::~shared_ptr (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr.h:121 #10 crucible::Fd::~Fd (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at ../include/crucible/fd.h:46 #11 BeesFileRange::file_size (this=0x7f4c4e5ba4a0) at bees-types.cc:156 #12 0x0000555760513950 in operator<< (os=..., bfr=...) at bees-types.cc:80 #13 0x000055576050d662 in std::function<void (std::ostream&)>::operator()(std::ostream&) const (__args#0=..., this=0x7f4c4e5b9f60) at /usr/include/c++/10/bits/std_function.h:622 #14 BeesNote::get_status[abi:cxx11]() () at bees-trace.cc:165 #15 0x00005557604c9676 in BeesContext::dump_status (this=0x5557611c4de0) at bees-context.cc:89 #16 0x00005557605206fb in std::function<void ()>::operator()() const (this=this@entry=0x7f4c4c5b65f0) at /usr/include/c++/10/bits/std_function.h:622 #17 crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) (f=..., explainer=...) at error.cc:55 #18 0x000055576050aaa7 in operator() (__closure=0x5557611c52c8) at bees-thread.cc:22 #19 0x00007f4c501beed0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #20 0x00007f4c502c8ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477 #21 0x00007f4c4febddef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Fix by making BeesFileRange::m_fd really const (not just mutable), then fix all the broken code referencing it. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2021-12-19 15:10:02 -05:00
Zygo Blaxell	01734e6d4b	hash: initialize m_dirty in BeesHashTable It turns out we never set m_dirty's initial value. This is not a practical problem because 1) it's mostly harmless if m_dirty is spuriously true, 2) we set it to true every time bees scans a data block, and 3) the allocation happens early in startup when most memory allocations are using zero-filled pages, so it's probably getting a false value at construction in most cases. valgrind complains about it, so it has to go. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2021-12-19 15:10:02 -05:00
Zygo Blaxell	a83c68eb18	bees: style cleanups: const, size_t, symbolic names No functional changes. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2021-12-19 15:10:02 -05:00
Zygo Blaxell	6d6686eb5b	context: get rid of resolve (LOGICAL_INO) serializer There are kernel bugs in LOGICAL_INO from time to time; however, we can't avoid these bugs by serializing LOGICAL_INO calls. It hasn't been used for some time, so remove the code and less-than-completely-accurate comments. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2021-12-19 15:10:02 -05:00
Zygo Blaxell	85c93c10e6	bees: clean up #include list No need for atomic, and sort the Linux headers. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2021-11-29 21:27:48 -05:00

1 2 3

119 Commits