1
0
mirror of https://github.com/Zygo/bees.git synced 2025-08-01 13:23:28 +02:00

817 Commits

Author SHA1 Message Date
Zygo Blaxell
3a17a4dcdd tempfile: make sure FS_COMPR_FL stays set
btrfs will set the FS_NOCOMP_FL flag when all of the following are true:

1.  The filesystem is not mounted with the `compress-force` option
2.  Heuristic analysis of the data suggests the data is compressible
3.  Compression fails to produce a result that is smaller than the original

If the compression ratio is 40%, and the original data is 128K long,
then compressed data will be about 52K long (rounded up to 4K), so item
3 is usually false; however, if the original data is 8K long, then the
compressed data will be 8K long too, and btrfs will set FS_NOCOMP_FL.

To work around that, keep setting FS_COMPR_FL and clearing FS_NOCOMP_FL
every time a TempFile is reset.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-29 23:25:36 -04:00
Zygo Blaxell
4039ef229e tempfile: clear FS_NOCOW_FL while setting FS_COMPR_FL
FS_NOCOW_FL can be inherited from the subvol root directory, and it
conflicts with FS_COMPR_FL.

We can only dedupe when FS_NOCOW_FL is the same on src and dst, which
means we can only dedupe when FS_NOCOW_FL is clear, so we should clear
FS_NOCOW_FL on the temporary files we create for dedupe.

Fixes: https://github.com/Zygo/bees/issues/314
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-29 23:24:55 -04:00
Zygo Blaxell
e9d4aa4586 roots: make the "idle" label useful
Apply the "idle" label only when the crawl is finished _and_ its
transid_max is up to date.  This makes the keyword "idle" better reflect
when bees is not only finished crawling, but also scanning the crawled
extents in the queue.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 23:06:14 -04:00
Zygo Blaxell
504f4cda80 progress: move the "idle" cell to the next cycle ETA column
When all extents within a size tier have been queued, and all the
extents belong to the same file, the queue might take a long time to
fully process.  Also, any progress that is made will be obscured by
the "idle" tag in the "point" column.

Move "idle" to the next cycle ETA column, since the ETA duration will
be zero, and no useful information is lost since we would have "-"
there anyway.

Since the "point" column can now display the maximum value, lower
that maximum to 999999 so that we don't use an extra column.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 22:33:05 -04:00
Zygo Blaxell
6c36f4973f extent scan: log the bfr when removing a prealloc extent
With subvol scan, the crawl task name is the subvol/inode pair
corresponding to the file offset in the log message.  The identity of
the file can be determined by looking up the subvol/inode pair in the
log message.

With extent scan, the crawl task name is the extent bytenr corresponding
to the file offset in the log message.  This extent is deleted when the
log message is emitted, so a later lookup on the extent bytenr will not
find any references to the extent, and the identity of the file cannot
be determined.

Log the bfr, which does a /proc lookup on the name of the fd, so the
filename is logged.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 22:33:05 -04:00
Zygo Blaxell
b1bd99c077 seeker: harden against changes in the data during binary search
During the search, the region between `upper_bound` and `target_pos`
should contain no data items.  The search lowers `upper_bound` and raises
`lower_bound` until they both point to the last item before `target_pos`.

The `lower_bound` is increased to the position of the last item returned
by a search (`high_pos`) when that item is lower than `target_pos`.
This avoids some loop iterations compared to a strict binary search
algorithm, which would increase `lower_bound` only as far as `probe_pos`.

When the search runs over live extent items, occasionally a new extent
will appear between `upper_bound` and `target_pos`.  When this happens,
`lower_bound` is bumped up to the position of one of the new items, but
that position is in the "unoccupied" space between `upper_bound` and
`target_pos`, where no items are supposed to exist, so `seek_backward`
throws an exception.

To cut down on the noise, only increase `lower_bound` as far as
`upper_bound`.  This avoids the exception without increasing the number
of loop iterations for normal cases.

In the exceptional cases, extra loop iterations are needed to skip over
the new items.  This raises the worst-case number of loop iterations
by one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
d5e805ab8d seeker: add a real-world test case
This seek_backward failed in bees because an extent appeared during
the search:

	fetch(probe_pos = 6821971036, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821971004 = probe_pos - have_delta 32 (want_delta 32)
	fetch(probe_pos = 6821971004, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970972 = probe_pos - have_delta 32 (want_delta 32)
	fetch(probe_pos = 6821970972, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970908 = probe_pos - have_delta 64 (want_delta 64)
	fetch(probe_pos = 6821970908, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970780 = probe_pos - have_delta 128 (want_delta 128)
	fetch(probe_pos = 6821970780, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970524 = probe_pos - have_delta 256 (want_delta 256)
	fetch(probe_pos = 6821970524, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821970012 = probe_pos - have_delta 512 (want_delta 512)
	fetch(probe_pos = 6821970012, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821968988 = probe_pos - have_delta 1024 (want_delta 1024)
	fetch(probe_pos = 6821968988, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821966940 = probe_pos - have_delta 2048 (want_delta 2048)
	fetch(probe_pos = 6821966940, target_pos = 6821971036)
	 = 6822575316..6822575316
	probe_pos 6821962844 = probe_pos - have_delta 4096 (want_delta 4096)
	fetch(probe_pos = 6821962844, target_pos = 6821971036)
	 = 6821962845..6821962848
	found_low = true, lower_bound = 6821962845
	lower_bound = high_pos 6821962848
	loop: lower_bound 6821962848, probe_pos 6821966942, upper_bound 6821971036
	fetch(probe_pos = 6821966942, target_pos = 6821971036)
	 = 6822575316..6822575316
	upper_bound = probe_pos 6821966942
	loop: lower_bound 6821962848, probe_pos 6821964895, upper_bound 6821966942
	fetch(probe_pos = 6821964895, target_pos = 6821971036)
	 = 6822575316..6822575316
	upper_bound = probe_pos 6821964895
	loop: lower_bound 6821962848, probe_pos 6821963871, upper_bound 6821964895
	fetch(probe_pos = 6821963871, target_pos = 6821971036)
	 = 6822575316..6822575316
	upper_bound = probe_pos 6821963871
	loop: lower_bound 6821962848, probe_pos 6821963359, upper_bound 6821963871
	fetch(probe_pos = 6821963359, target_pos = 6821971036)
	 = 6821963411..6821963422
	lower_bound = high_pos 6821963422
	loop: lower_bound 6821963422, probe_pos 6821963646, upper_bound 6821963871
	fetch(probe_pos = 6821963646, target_pos = 6821971036)
	 = 6822575316..6822575316

Here, we found nothing between 6821963646 and 6822575316, so upper_bound is reduced
to 6821963646...

	upper_bound = probe_pos 6821963646
	loop: lower_bound 6821963422, probe_pos 6821963534, upper_bound 6821963646
	fetch(probe_pos = 6821963534, target_pos = 6821971036)
	 = 6821963536..6821963539
	lower_bound = high_pos 6821963539
	loop: lower_bound 6821963539, probe_pos 6821963592, upper_bound 6821963646
	fetch(probe_pos = 6821963592, target_pos = 6821971036)
	 = 6821963835..6821963841

...but here, we found 6821963835 and 6821963841, which are between
6821963646 and 6822575316.  They were not there before, so the binary
search result is now invalid because new extent items were added while
it was running.  This results in an exception:

	lower_bound = high_pos 6821963841
	--- BEGIN TRACE --- exception ---
	objectid = 27942759813120, adjusted to 27942793363456 at bees-roots.cc:1103
	Crawling extent BeesCrawlState 250:0 offset 0x0 transid 1311734..1311735 at bees-roots.cc:991
	get_state_end at bees-roots.cc:988
	find_next_extent 250 at bees-roots.cc:929
	---  END  TRACE --- exception ---
	*** EXCEPTION ***
	exception type std::out_of_range: lower_bound = 6821963841, upper_bound = 6821963646 failed constraint check (lower_bound <= upper_bound) at ../include/crucible/seeker.h:139

The exception prevents the result of seek_backward from returning a value,
which prevents a nonsense result from a consumer of that value.

Copy the details of this search into a test case.  Note that the test
case won't reproduce the exception because the simulation of fetch()
is not changing the results part way through.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
337bbffac1 extent scan: drop a nonsense trace message
This message appears only during exception backtraces, but it doesn't
carry any useful information.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
527396e5cb extent scan: integrate seeker debug output stream
Send both tree_search ioctl and `seek_backward` debug logs to the
same output stream, but only write that stream to the debug log if
there is an exception.

The feature remains disabled at compile time.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
bc7c35aa2d extent scan: only write a detailed debug log when there's an exception
Note that when enabled, the logs are still very CPU-intensive,
but most of the logs will be discarded.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
0953160584 trace: export exception_check
We need to call this from more than one place in bees.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
80f9c147f7 btrfs-tree: clean up the fetch function's return set
Commit d32f31f411 ("btrfs-tree: harden
`rlower_bound` against exceptional objects") passes the first btrfs item
in the result set that is above upper_bound up to `seek_backward`.
This is somewhat wasteful as `seek_backward` cannot use such a result.

Reverse that change in behavior, while keeping the rest of the other
commit.

This introduces a new case, where the search ioctl is producing items
that are above upper bound, but there are no items in the result set,
which continues looping until the end of the filesystem is reached.
Handle that by setting an explicit exit variable.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
50e012ad6d seeker: add a runtime debug stream
This allows detailed but selective debugging when using the library,
particularly when something goes wrong.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
9a9644659c trace: clean up the formatting around top-level exception log messages
Fewer newlines.  More consistent application of the "TRACE:" prefix.
All at the same log level.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
fd53bff959 extent scan: drop out-of-date comment
The comment describes an earlier version which submitted each extent
ref as a separate Task, but now all extent refs are handled by the same
Task to minimize the amount of time between processing the first and
last reference to an extent.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
9439dad93a extent scan: extra check to make sure no Tasks are started when throttled
Previously `scan()` would run the extent scan loop once, and enqueue one
extent, before checking for throttling.  Do an extra check before that,
and bail out so that zero extents are enqueued when throttled.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
ef9b4b3a50 extent scan: shorten task name for extent map
Linux kernel thread names are hardcoded at 16 characters.  Every character
counts, and "0x" wastes two.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
7ca857dff0 docs: add the ghost subvols bug to the bugs list
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Zygo Blaxell
8331f70db7 progress: fix ETA calculations
The "tm_left" field was the estimated _total_ duration of the crawl,
not the amount of time remaining.  The ETA timestamp was then calculated
based on the estimated time to run the crawl if it started _now_, not
at the start timestamp.

Fix the duration and ETA calculations.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-06-18 21:17:48 -04:00
Steven Allen
a844024395 Make the runtime directory private
The status file contains sensitive information like filenames and duplicate chunk ranges. It might also make sense to set the process-wide `UMask=`, but that may have other unintended side effects.
2025-03-26 15:02:42 +00:00
Zygo Blaxell
47243aef14 hash: handle $BEESHOME on btrfs too
The `_nothrow` variants of `do_ioctl` return true when they succeed,
which is the opposite of what `ioctl` does.

Fix the logic so bees can correctly identify its own hash table when
it's on the same filesystem as the target.

Fixes: f6908420ad ("hash: handle $BEESHOME on non-btrfs")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-17 21:18:08 -05:00
Zygo Blaxell
a670aa5a71 extent scan: don't divide by zero if there were no loops
Commit 183b6a5361 ("extent scan: refactor
BeesCrawl, BeesScanMode*") moved some statistics calculations out of
the loop in `find_next_extent`, but did not ensure that the statistics
would not be calculated if the loop had not executed any iterations.

In rare instances, the function returns without entering the loop at all,
which results in divide by zero.  Add a check just before doing that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
51b3bcdbe4 trace: deprecate BEESLOGTRACE, align trace logs with exception notices
Exceptions were logged at level NOTICE while the stack traces were logged
at level DEBUG.  That produced useless noise in the output with `-v5`
or `-v6`, where there were exception headings logged, but no details.

Fix that by placing the exceptions and traces at level DEBUG, but prefix
them with `TRACE:` for easy grepping.

Most of the events associated with BEESLOGTRACE either never happen,
or they are harmless (e.g. trying to open deleted files or subvols).
Reassign them to ordinary BEESLOGDEBUG, with one exception for
unrecognized Extent flags that should be debugged if any appear.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
ae58401d53 trace: avoid one copy in every trace function
While investigating https://github.com/Zygo/bees/issues/282 I noticed that
we're doing at least one unnecessary extra copy of the functor in BEESTRACE.
Get rid of it with a const reference.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-13 23:59:42 -05:00
Zygo Blaxell
3e7eb43b51 BeesStringFile: figure out when to call--or _not_ call--fsync
Older kernel versions featured some bugs in btrfs `fsync`, which could
leave behind "ghost dirents", orphan filename items that did not have
a corresponding inode.  These dirents were created during log replay
during the first mount after a crash due to several different bugs in
the log tree and its use over the years.  The last known bug of this
kind was fixed in kernel 5.16.  As of this writing, no fixes for this
bug have been backported to any earlier LTS kernel.

Some filesystems, including btrfs, will flush the contents of a new
file before renaming it over an old file.  On paper, btrfs can do this
very cheaply since the contents of the new file are not referenced, and
the old file not dereferenced, until a tree commit which includes both
actions atomically; however, in real life, btrfs provides `fsync`-like
semantics and uses the log-tree infrastructure to implement them, which
compromises performance and acts as a magnet for bugs.

The benefit of this trade-off is that `rename` can be used as a
synchronization point for data outside of the btrfs, which would not
happen if everything `rename` does was simply deferred to the next
tree commit.  The cost of this trade-off is that for the first 8 years
of its existence, bees would trigger the bug so often that the project
recommended its users put $BEESHOME in its own subvol to make it easy
to remove ghost dirents left behind by the bug.

Some other filesystems, such as xfs, don't have any special semantics for
`rename`, and require `fsync` to avoid garbage or missing data after
a crash.  Even filesystems which do have a special case for `rename`
can be configured to turn it off.

btrfs will silently delete data from files in the event that an
unrecoverable data block write error occurs.  Kernel version 6.2 adds
important new and unexpected cases where this can happen on filesystems
using raid56 data, but it also happens in all usable btrfs versions
(the silent deletion behavior was introduced in kernel version 3.9).

Unrecoverable write errors are currently reported to userspace only
through `fsync`.  Since the failed extents are deleted, they cannot be
detected via csum failures or scrub after the fact--and it's too late
by then, the data is already gone.  `fsync` is the last opportunity
to detect the write failure before the `rename`.  If the error is not
detected, the contents of the file will be silently discarded in btrfs.
The impact on bees is that scans will abruptly restart from zero after
a crash combined with some other reasonably common failures.

Putting all of this together leads to a rather complex workaround:
if the filesystem under $BEESHOME (specifically, the filesystem where
BeesStringFile objects such as `beescrawl.dat` are written) is a btrfs
filesystem, and the host kernel is a version prior to 5.16, then don't
call `fsync` before `rename`.  In all other cases, do call `fsync`,
and prevent dependent writes (i.e. the following `rename`) in the event
of errors.

Since present kernel versions still require `fsync`, we don't need
an upper bound on the kernel version check until someone fixes btrfs
`rename` (or perhaps adds a flag to `renameat2` which prevents use of
the log tree) in the kernel.  Once that fix happens, we can drop the
`fsync` call for kernels after that fixed version.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-10 21:04:20 -05:00
Zygo Blaxell
962d94567c hexdump: fix pointer cast const mismatch
Another hit from the exotic compiler collection:  build fails on GCC 9,
from Ubuntu 20...but not later versions of GCC.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-10 21:00:31 -05:00
Zygo Blaxell
6dbef5f27b fs: improve compatibility with linux-libc-dev 5.4
Fix the missing symbols that popped up when adding chunk tree to
lib/fs.cc.  Also define the missing symbols instead of merely trying to
avoid them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-08 21:17:15 -05:00
Zygo Blaxell
88b1e4ca6e main: unconditionally enable workaround for the logical_ino-vs-clone kernel bug
This obviously doesn't fix or prevent the kernel bug, but it does prevent
bees from triggering the bug without assitance from another application.

The bug can still be triggered by running bees at the same time as an
application which uses clone or LOGICAL_INO.  `btdu` uses LOGICAL_INO,
while `cp` from coreutils (and many others) use clone (reflink copy).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 23:14:16 -05:00
Zygo Blaxell
c1d7fa13a5 roots: drop unnecessary mutex unlock in stop_request
In commit 31b2aa3c0d ("context: speed
up orderly process termination"), the stop request was split into two
methods after the mutex unlock.

Now that there's nothing after the mutex unlock in `stop_request`,
there's no need for an explicit unlock to do what the destructor would
have done anyway.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 23:14:16 -05:00
Zygo Blaxell
aa39bddb2d extent scan: implement an experimental ordered scan mode
Parallel scan runs each extent size tier in a separate thread.  The
threads compete to process extents within the tier's size range.

Ordered scan processes each extent size tier completely before moving on
to the next.  In theory, this means large extents always get processed
quickly, especially when new ones appear, and the queue does not fill up
with small extents.

In practice, the multi-threaded scanner massively outperforms the
single-threaded scanner, unless the number of worker threads is very
small (i.e. one).

Disable most of the feature for now, but leave the code in place so it
can be easily reactivated for future testing.

Ordered scan introduces a parallelized extent mapper Task.  Keep that in
parallel scan mode, which further enhances the parallelism.  The extent
scan crawl threads now run at 'idle' priority while the map tasks run
at normal priority, so the map tasks don't flood the task queue.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 23:14:16 -05:00
Zygo Blaxell
1aea2d2f96 crawl: deprecate use of BeesCrawl to search the extent tree
BeesScanModeExtent can do that by itself now.  Overloading the subvol
crawl code resulted in an ugly, inefficient hack, and we definitely
don't want to accidentally continue to use it.

Remove the support for reading the extent tree and add some `assert`s
to make sure it isn't still used somewhere.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
673b450671 docs: update event counters after extent scan refactoring and crawl skipping
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
183b6a5361 extent scan: refactor BeesCrawl, BeesScanMode*
The main gains here are:

* Move extent tree searches into BeesScanModeExtent so that they are
not slowed down by the BeesCrawl code, which was designed for the
much more specialized metadata in subvol trees.
* Enable short extent skipping now that BeesCrawl is out of the way.
* Stop enumerating btrfs subvols when in extent scan mode.

All this gets rid of >99% of unnecessary extent tree searches.
Incremental extent scan cycles now finish in milliseconds instead
of minutes.

BeesCrawl was never designed to cope with the structure and content of
the extent tree.  It would waste thousands of tree-search ioctl calls
reading and ignoring metadata items.

Performance was particularly bad when a binary search was involved, as any
binary search probe that landed in a metadata block group would read and
discard all the metadata items in the block group, sequentially, repeated
for each level of the binary search.  This was blocking implementation of
short extent skipping optimization for large extent size tiers, because
the skips were using thousands of tree searches to skip over only a few
hundred extent items.

Extent scan also had to read every extent item twice to do the
transid filtering, because BeesCrawl's interface discarded the relevant
information when it converted a `BtrfsTreeItem` into a `BeesFileRange`.
The cost of this extra fetch was negligible, but it could have been zero.

Fix this by:

* Copy the equivalent of `fetch_extents` from BeesCrawl into
`BeesScanModeExtent`, then give each of the extent scan crawlers its
own `BtrfsDataExtentTreeFetcher` instance.  This enables extent tree
searches to avoid pure (non-mixed) metadata block groups.  `BeesCrawl`
is now used only for its interface to `BeesRoots` for saving state in
`beescrawl.dat`, and never to determine the next extent tree item.

* Move subvol-specific parts of `BeesRoots` into a new class
`BeesScanModeSubvol` so that `BtrfsScanModeExtent` doesn't have to enable
or support them.  In particular, `bees -m4` no longer enumerates all
of the _subvol_ crawlers.  `BeesRoots` is still used to save and load
crawl state.

* Move several members from `BtrfsScanModeExtent` into a per-crawler
state object `SizeTier` to eliminate the need for some locks and to
maintain separate cache state for `BtrfsDataExtentTreeFetcher`.

* Reuse the `BtrfsTreeItem` to get the generation field for the transid
range filter.

* Avoid a few corner cases when handling errors, where extent scan might
drop an extent without scanning it, or fail to advance to the next extent.

* Enable the extent-skipping algorithm for large size tiers, now that
`BeesCrawl::fetch_extents` is no longer slowing it down.

* Add a debug stream interface which developers can easily turn on when
needed to inspect the decisions that extent scan is making.

* Track metrics that are more useful, particularly searches per extent
scanned, and fraction of extents that are skipped.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:43:22 -05:00
Zygo Blaxell
b6446d7316 roots: rework open_root_nocache to use btrfs-tree
This gets rid of one open-coded btrfs tree search.

Also reduce the log noise level for subvol open failures, and remove
some ancient references to `BEESLOG`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
d32f31f411 btrfs-tree: harden rlower_bound against exceptional objects
Rearrange the logic in `rlower_bound` so it can cope with a tree
that contains mostly block-aligned objects, with a few exceptions
filtered out by `hdr_stop`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
dd08f6379f btrfs-tree: add a method to get root backref items to BtrfsRootFetcher
This complements the already existing support for reading the fields of
a root backref.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
58ee297cde btrfs-tree: connect methods to the debug stream interface
In some cases functions already had existing debug stream support
which can be redirected to the new interface.  In other cases, new
debug messages are added.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
a3c0ba0d69 fs: add a runtime debug stream for btrfs tree searches
This allows plugging in an ostream at run time so that we can audit all
the search calls we are doing.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
75040789c6 btrfs-tree: drop BtrfsFsTreeFetcher and clean up class comments
BtrfsFsTreeFetcher was used for early versions of the extent scanner, but
neither subvol nor extent scan now needs an object that is both persistent
and configured to access only one subvol.  BtrfsExtentDataFetcher does
the same thing in that case.

Clarify the comments on what the remaining classes do, so that
BtrfsFsTreeFetcher doesn't get inadvertently reinvented in the future.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
f9a697518d btrfs-tree: introduce BtrfsDataExtentTreeFetcher to read data extents without metadata
Binary searches can be extremely slow if the target bytenr is near a
metadata block group, because metadata items are not visible to the
binary search algorithm.  In a non-mixed-bg filesystem, there can be
hundreds of thousands of metadata items between data extent items, and
since the binary search algorithm can't see them, it will run searches
that iterate over hundreds of thousands of objects about a dozen times.

This is less of a problem for mixed-bg filesystems because the data and
metadata blocks are not isolated from each other.  The binary search
algorithm still can't see the metadata items, but there are usually
some data items close by to prevent the linear item filter from running
too long.

Introduce a new fetcher class (all the good names were taken) that tracks
where the end of the current block group is.  When the end of the current
block group is reached in the linear search, skip ahead to a block group
that can contain data items.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
c4ba6ec269 fs: add a ntoa function for chunk types
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
440740201a main: the base directory for --strip-paths should be root_fd, not cwd
The cwd is where core dumps and various profiling and verification
libraries want to write their data, whereas root_fd is the root of the
target filesystem.  These are often intentionally different.  When
they are different, `--strip-paths` sets the wrong prefix to strip
from paths.

Once the root fd has been established, we can set the path prefix to
the string prefix that we'll get from future calls to `name_fd`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
f6908420ad hash: handle $BEESHOME on non-btrfs
bees explicitly supports storing $BEESHOME on another filesystem, and
does not require that filesystem to be btrfs; however, if $BEESHOME
is on a non-btrfs filesystem, there is an exception on every startup
when trying to identify the subvol root of the hash table file in order
to blacklist it, because non-btrfs filesystems don't have subvol roots.

Fix by checking not only whether $BEESHOME is on btrfs, but whether it
is on the _same_ btrfs, as the bees root, without throwing an exception.
The hash table is blacklisted only when both filesystems are btrfs and
have the same fsid.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
925b12823e fs: add do_ioctl_nothrow and fsid methods to btrfs fs info
Enable use of the ioctl to probe whether two fds refer to the same btrfs,
without throwing an exception.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
561e604edc seeker: turn off debug logging
The debug log is only revealed when something goes wrong, but it is
created and discarded every time `seek_backward` is called, and it
is quite CPU-intensive.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
30cd375d03 readahead: clean up the code, update docs
Remove dubious comments and #if 0 section.  Document new event counters,
and add one for read failures.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
48b7fbda9c progress: adjust minimum thresholds for ETA to 10 seconds and 1 GiB of data
1% is a lot of data on a petabyte filesystem, and a long time to wait for an
ETA.

After 1 GiB we should have some idea of how fast we're reading the data.
Increase the time to 10 seconds to avoid a nonsense result just after a scan
starts.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-02-06 22:42:15 -05:00
Zygo Blaxell
85aba7b695 openat2: #include <linux/types.h> so we can know __u64
Alternative implementations could use `uint64_t` instead, from `cstdint`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 17:02:19 -05:00
Zygo Blaxell
de38b46dd8 scripts/beesd: harden the mount options
* `nodev`: This reduces rename attack surface by preventing bees from
 opening any device file on the target filesystem.

 * `noexec`: This prevents access to the mount point from being leveraged
 to execute setuid binaries, or execute anything at all through the
 mount point.

These options are not required because they duplicate features in the
bees binary (assuming that the mount namespace remains private):

 * `noatime`: bees always opens every file with `O_NOATIME`, making
 this option redundant.

 * `nosymfollow`: bees uses `openat2` on kernels 5.6 and later with
 flags that prevent symlink attacks.  `nosymfollow` was introduced in
 kernel 5.10, so every kernel that can do `nosymfollow` can already do
 `openat2`.  Also, historically, `$BEESHOME` can be a relative path with
 symlinks in any path component except the last one, and `nosymfollow`
 doesn't allow that.

Between `openat2` and `nodev`, all symlink attacks are prevented, and
rename attacks cannot be used to force bees to open a device file.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 01:00:41 -05:00
Zygo Blaxell
0abf6ebb3d scripts/beesd: no need for $BEESHOME to be a subvol
We _recommend_ that `$BEESHOME` should be a subvol, and we'll create a
subvol if no directory exists; however, there's no reason to reject an
existing plain directory if the user chooses to use one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-20 00:43:13 -05:00
Kai Krakow
360ce7e125 scripts/beesd: Unshare namespace without systemd
If starting the beesd script without systemd, the mount point won't
automatically unmount if the script is cancelled with ctrl+c.

Fixes: https://github.com/Zygo/bees/issues/281
Signed-off-by: Kai Krakow <kai@kaishome.de>
2025-01-20 00:05:57 -05:00
Zygo Blaxell
ad11db2ee1 openat2: supply the missing definitions for building with old headers and new kernel
Apparently Ubuntu 20 has upgraded to kernel 5.15, but still builds things
with 5.4 headers.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:20:06 -05:00
Zygo Blaxell
874832dc58 openat2: log a warning when we fall back to openat
This should occur only once per run, but it's worth leaving a note
that it has happened.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:19:42 -05:00
Zygo Blaxell
5fe89d85c3 extent scan: make sure we run every extent crawler once per transaction
There's a pathological case where all of the extent scan crawlers except
one are at the end of a crawl cycle, but the one crawler that is still
running is keeping the Task queue full.  The result is that bees never
starts the other extent scan crawlers, because the queue is always
full at the instant a new transid triggers the start of a new scan.
That's bad because it will result in bees falling behind when new data
from the inactive size tiers appears.

To fix this, check for throttling _after_ creating at least one scan task
in each crawler.  That will keep the crawlers running, and possibly allow
them to claw back some space in the Task queue.  It slightly overcommits
the Task queue, so there will be a few more Tasks than nominally allowed.

Also (re)introduce some hysteresis in the queue size limit and reduce it
a little, so that bees isn't continually stopping and restarting crawls
every time one task is created or completed, and so that we stay under
the configured Task limit despite overcommitting.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 22:19:42 -05:00
Zygo Blaxell
a2b3e1e0c2 log: demote a lot of BEESLOGWARN to higher verbosity levels
Toxic extent workarounds are going away because the underlying kernel
bugs have been fixed.  They are no longer worthy of spamming non-developer
logs.

INO_PATHS can return no paths if an inode has been deleted.  It doesn't
need a log message at all, much less one at WARN level.

Dedupe failure can be INFO, the same level as dedupe itself, especially
since the "NO dedupe" message doesn't mention what was [not] deduped.

Inspired by Kai Krakow's "context: demote "abandoned toxic match" to
debug log level".

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-19 01:08:28 -05:00
Kai Krakow
aaec931081 context: demote "abandoned toxic match" to debug log level
This log message creates a overwhelmingly lot of messages in the system
journal, leading to write-back flushing storms under high activity. As
it is a work-around message, it is probably only useful to developers,
thus demote to debug level.

This fixes latency spikes in desktop usage after adding a lot of new
files, especially since systemd-journal starts to flush caches if it
sees memory pressure.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2025-01-19 00:59:22 -05:00
Zygo Blaxell
c53fa04a2f task: fixes for priority and idle Tasks
Tasks are not allowed to be queued more than once, but it is allowed
to queue a Task while it's already running, which means a Task can be
executed on two threads in parallel.  Tasks detect this and handle it
by queueing the Task on its own post-exec queue.  That in turn leads
to Workers which continually execute the same Task if that Task doesn't
create any new Tasks, while other Tasks sit on the Master queue waiting
for a Worker to dequeue them.

For idle Tasks, we don't want the Task to be rescheduled immediately.
We want the idle Task to execute again after every available Task on
both the main and idle queues has been executed.

Fix these by having each Task reschedule itself on the appropriate
queue when it finishes executing.

Priority queued Tasks should executed in priority order not just one
Task's post-exec queue, but the entire local queue of the TaskConsumer.

Fix this by moving the sort into either the TaskConsumer that receives
a post-exec queue, if there is one, or into the Task that is created
to insert the post-exec queue into a TaskConsumer when one becomes
available in the future.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-15 00:43:25 -05:00
Zygo Blaxell
d4a681c8a2 Revert "roots: use a non-idle task for next_transid"
next_transid tasks don't respect queue selection very well, because
they effectively end up spinning in a loop until all other worker
threads become busy.

Back this out, and fix the priority handling in the Task library.

This reverts commit 58db4071de.
2025-01-12 18:48:33 -05:00
Zygo Blaxell
a819d623f7 task: do not allow queue loops in priority queueing mode
Tasks using non-priority FIFO dependency tracking can insert themselves
into their own queue, to run the Task again immediately after it exits.

For priority queues, this attempts to splice the post-exec queue into
itself, which doesn't seem like a good idea.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-12 15:28:26 -05:00
Zygo Blaxell
de9d72da80 task: flatten queues of dependent Tasks
Suppose Task A, B, and C are created in that order, and currently running.
Task T acquires Exclusion E.  Task B, A, and C attempt to acquire the
same Exclusion, in that order, but fail because Task T holds it.

The result is Task T with a post-exec queue:

        T, [ B, A, C ]  sort_requested

Now suppose Task U acquires Exclusion F, then Task T attempts to acquire
Exclusion F.  Task T fails to acquire F, so T is inserted into U's
post-exec queue.  The result at the end of the execution of T is a tree:

        U, [ T ]  sort_requested
             \-> [ B, A, C ] sort_requested

Task T exits after failing to acquire a lock.  When T exits, T will
sort its post-exec queue and submit the post-exec queue for execution
immediately:

        Worker 1: U, [ T ]  sort_requested
        Worker 2: A, B, C

This isn't ideal because T, A, B, and C all depend on at least one
common Exclusion, so they are likely to immediately conflict with T
when U exits and T runs again.

Ideally, A, B, and C would at least remain in a common queue with T,
and ideally that queue is sorted.

Instead of inserting T into U's post-exec queue, insert T and all
of T's post-exec queue, which creates a single flattened Task list:

        U, [ T, B, A, C ]   sort_requested

Then when U exits, it will sort [ T, B, A, C ] into [ A, B, C, T ],
and run all of the queued Tasks in age priority order:

        U exited, [ T, B, A, C ]   sort_requested

        U exited, [ A, B, C, T ]

        [ A, B, C, T ] on TaskConsumer queue

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-12 14:05:44 -05:00
Zygo Blaxell
74d8bdd60f task: add an insert method for priority-queueing Tasks by age
Task started out as a self-organizing parallel-make algorithm, but ended
up becoming a half-broken wait-die algorithm.  When a contended object
is already locked, Tasks enter a FIFO queue to restart and acquire the
lock.  This is the "die" part of wait-die (all locks on an Exclusion are
non-blocking, so no Task ever does "wait").  The lock queue is FIFO wrt
_lock acquisition order_, not _Task age_ as required by the wait-die
algorithm.

Make it a 25%-broken wait-die algorithm by sorting the Tasks on lock
queues in order of Task ID, i.e. oldest-first, or FIFO wrt Task age.
This ensures the oldest Task waiting for an object is the one to get
it when it becomes available, as expected from the wait-die algorithm.

This should reduce the amount of time Tasks spend on the execution queue,
and reduce memory usage by avoiding the accumulation of Tasks that cannot
make forward progress.

Note that turning `TaskQueue` into an ordered container would have
undesirable side-effects:

 * `std::list` has some useful properties wrt stability of object
 location and cost of splicing.  Other containers may not have these,
 and `std::list` does have a `sort` method.

 * Some Task objects are created at the beginning and reused continually,
 but we really do want those Tasks to be executed in FIFO order wrt
 submission, not Task ID.  We can exclude these tasks by only doing the
 sorting when a Task is queued for an Exclusin object.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-12 00:35:37 -05:00
Zygo Blaxell
a5d078d48b docs: deprecate the --workaround-btrfs-send option
Emphasize that the option is relevant to old kernels, older than the
minimum supportable version threshold.

De-emphasize the use case of "send-workaround" as a synonym for "exclude
read-only".

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:56 -05:00
Zygo Blaxell
e2587cae9b docs: expand "Threads and load management" to suggest not running bees so much
One of the more obvious ways to reduce bees load is to simply not run
it all the time.  Explicitly state using maintenance windows as a load
management option.

SIGUSR1 and SIGUSR2 should have been documented somewhere else before now.
Better late than never.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:56 -05:00
Zygo Blaxell
ac581273d3 docs: config.md updates
The theories behind bees slowing down when presented with a larger has
table turned out to be wrong.  The real cause was a very old bug which
submitted thousands of `LOGICAL_INO` requests when only a handful of
requests were needed.

"Compression on the filesystem" -> "Compression in files"

Don't be so "dramatic".  Be "rapid" instead.

Remove "cannot avoid modifying read-only snapshots" as a distinction
between subvol and extent scans.  Both modes support send workaround
and send waiting with no significant distinction.

Emphasize extent scan's better handling of many snapshots.  Also reflinks.

Add some discussion of `--throttle-factor`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:56 -05:00
Zygo Blaxell
7fcde97b70 docs: update the bug reporting and status instructions
Thread names have changed.  Document some of the newer ones.

Don't jump immediately to blaming poor performance on qgroups or
autodefrag.  These do sometimes have kernel regressions but not all
the time.

Emphasize advantage of controlling bees deferred work requests at the
source, before btrfs gets stuck committing them.

Avoid asserting that it's OK for gdb to crash.

Remove mention of lower-layer block device issues wrt corruption.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
e457f502b7 docs: update kernel bugs page for January 2025
"Kernel" -> "Linux kernel".  If you can run bees on a kernel that isn't
Linux, congratulations!

Emphasize the age of the data corruption warnings.  Once 5.4 reaches
EOL we can remove those.

Simplify the discussion of old kernels and API levels.  There's a
new optional kernel API for `openat2` support at 5.6.  The absolute
minimum kernel version is still 4.2, and will not increase to 4.15
until the subvol scanners are removed.

Remove discussion of bees support for kernels 4.19 (which recently
reached EOL) and earlier.

The `LOGICAL_INO` vs dedupe bug is actually a `LOGICAL_INO` vs clone bug.
Dedupe isn't necessary to reproduce it.

Remove a stray ')'.

Strip out most of the discussion of slow backrefs, as they are no longer a
concern on the range of supported kernel versions.  Leave some description
there because bees still has some vestigial workarounds.

Remove `btrfs send` from the "Unfixed kernel bugs" section, which makes
the section empty, so remove the section too.  bees now handles send on
a subvol reasonably well.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
46815f1a9d docs: update README.md
Emphasize "large" is an upper bound on the size of filesystem bees
can handle.

New strengths:  largest extent first for fixed maintenance windows,
scans data only once (ish), recovers more space

Removed weaknesses:  less temporary space

Need more caps than `CAP_SYS_ADMIN`.

Emphasize DATA CORRUPTION WARNING is an old-kernel thing.

Update copyright year.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
0d251d30f4 docs: update feature interaction lists
Tested on larger filesystems than 100T too, but let's use Fermi
approximation.  Next size is 1P.

Removed interaction with block-level SSD caching subsystems.  These are
really btrfs metadata vs. a lower block layer, and have nothing to do
with bees.

Added mixed block groups to the tested list, as mixed block groups
required explicit support in the extent scanner.

Added btrfs-convert to the tested list.  btrfs-convert has various
problems with space allocation in general, but these can be solved by
carefully ordered balances after conversion, and they have nothing to
do with bees.

In-kernel dedupe is dead and the stubs were removed years ago.  Remove it
from the list.

btrfs send now plays nicely with bees on all supportable kernels, now
that stable/linux-4.19.y is dead.  Send workaround is only needed for
kernels before v5.4 (technically v5.2, but nobody should ever mount a
btrfs with kernel v5.1 to v5.3).  bees will pause automatically when
deduping a subvol that is currently running a send.

bees will no longer gratuitously refragment data that was defragmented
by autodefrag.

Explicitly list all the RAID profiles tested so far, as there have been
some new ones.

Explicitly list other deduplicators tested.

Sort the list of btrfs features alphabetically.

Add scrub and balance, which have been tested with bees since the
beginning.

New tested btrfs features:  block-group-tree, raid1c3, raid1c4.

New untested btrfs features:  squotas, raid-stripe-tree.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
b8dd9a2db0 progress: put a timestamp in the bottom row
This records the time when the progress data was calculated, to help
indicate when the data might be very old.

While we're here, move "now" out of the loop so there's only one value.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
8bc90b743b task: get rid of the insert_task method
Nothing calls it (not even tests), and there's significant functional
overlap with `try_lock`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-11 23:39:55 -05:00
Zygo Blaxell
2f2a68be3d roots: use openat2 instead of openat when available
This increases resistance to symlink and mount attacks.

Previously, bees could follow a symlink or a mount point in a directory
component of a subvol or file name.  Once the file is opened, the open
file descriptor would be checked to see if its subvol and inode matches
the expected file in the target filesystem.  Files that fail to match
would be immediately closed.

With openat2 resolve flags, symlinks and mount points terminate path
resolution in the kernel.  Paths that lead through symlinks or onto
mount points cannot be opened at all.

Fall back to openat() if openat2() returns ENOSYS, so bees will still
run on kernels before v5.6.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-09 02:26:53 -05:00
Zygo Blaxell
82f1fd8054 process: replace crucible::gettid() with a weak symbol
Since we're now using weak symbols for dodgy libc functions, we might
as well do it for gettid() too.

Use the ::gettid() global namespace and let libc override it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-09 01:37:44 -05:00
Zygo Blaxell
a9b07d7684 openat2: create a weak syscall wrapper for it
openat2 allows closing more TOCTOU holes, but we can only use it when
the kernel supports it.

This should disappear seamlessly when libc implements the function.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-09 01:36:39 -05:00
Zygo Blaxell
613ddc3c71 progress: rename "ctime" -> "tm_left"
"ctime", an abbreviation of "cycle time", collides with "ctime", an
abbreviation of "st_ctime", a well-known filesystem term.

"tm_left" fits in the column, so use that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-06 12:50:50 -05:00
Zygo Blaxell
c3a39b7691 progress: rework the progress table after github discussion
* Report position within cycle in units that cannot be mistaken for size or percentage
* Put the total/maximum values in their own row
* Add a start time column
* Change column titles to reference "cycles"
* Use "idle" instead of "finished" when a crawler is not running
* Replace "transid" with "gen" because it's shorter

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:45:37 -05:00
Zygo Blaxell
58db4071de roots: use a non-idle task for next_transid
The scanners which finish early can become stuck behind scanners that are
able to keep the queue full.  Switch the next_transid task to the normal
Task queues so that we force scanners to restart on every new transaction,
possibly deferring already queued work to do so.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:36:53 -05:00
Zygo Blaxell
0d3e13cc5f context: report time in scan_one_extent
Add yet another field to the scan/skip report line:  the wallclock
time used to process the extent ref.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:36:53 -05:00
Zygo Blaxell
1af5fcdf34 roots: don't access a shared variable after releasing a lock
Access the local copy of `m_root_crawl_map` instead.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:36:53 -05:00
Zygo Blaxell
87472b6086 extent scan: don't put non-data block groups in the data extent map
The total data size should not include metadata or system block groups,
and already does not; however, we still have these block groups in the map
for mapping the crawl pointer to a logical offset within the filesystem.

Rearrange a few lines around the `if` statement so that the map doesn't
contain anything it should not.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:32:48 -05:00
Zygo Blaxell
ca351d389f extent scan: pick the right block groups for mixed-bg filesystems
The progress indicator was failing on a mixed-bg filesystem because those
filesystems have block groups which have both _DATA and _METADATA bits,
and the filesystem size calculation was excluding block groups that have
_METADATA set.  It should exclude block groups that have _DATA not set.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
1f0b8c623c options: improve message when too many--or too few--path arguments given
Running bees with no arguments complains about "Only one" path argument.
Replace this with "Exactly one" which uses similar terminology to other
btrfs tools.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
74296c644a options: return EXIT_SUCCESS after displaying help message
`getopt_long` already supplies a message when an option cannot be parsed,
so there isn't a need to distinguish option parse failures from help
requests.

Fixes: https://github.com/Zygo/bees/pull/277
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
231593bfbc throttle: don't hold the multilock during throttle
Release the lock before entering the throttle sleep, so that other
threads can still run.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
d4900cc5d5 docs: default throttle is zero
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
81bbf7e1d4 throttle: set default to 0.0
Longer latency testing runs are not showing a consistent gain from a
throttle factor of 1.0.  Make the default more conservative.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
bd9dc0229b docs: add --throttle-factor option
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:15:37 -05:00
Zygo Blaxell
2a1ed0b455 throttle: track time values more closely
Decaying averages by 10% every 5 minutes gives roughly a half-hour
half-life to the rolling average.  Speed that up to once per minute.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:14:31 -05:00
Zygo Blaxell
d160edc15a throttle: add --throttle-factor option to control throttling factor
Also change the initializer syntax for the option list to use C99
compound literals.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-01-03 23:13:51 -05:00
Zygo Blaxell
e79b242ce2 options: clean up the parser, prepare for new options with no short form
We're not adding any more short options, but the debugging code doesn't
work with optvals above 255.  Also clean up constness and variable
lifetimes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-16 23:32:18 -05:00
Zygo Blaxell
ea45982293 throttle: add delays to match deferred request rate to btrfs completion rate
Measure the time spent running various operations that extend btrfs
transaction completion times (`LOGICAL_INO`, tmpfiles, and dedupe)
and arrange for each operation to run for not less than the average
amount of time by adding a sleep after each operation that takes less
than the average.

The delay after each operation is intended to slow down the rate of
deferred and long-running requests from bees to match the rate at which
btrfs is actually completing them.  This may help avoid big spikes in
latency if btrfs has so many requests queued that it has to force a
commit to release memory.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-16 23:32:18 -05:00
Zygo Blaxell
f209cafcd8 bees: bump the file limits again, 512k files and 64k dirs
Test machines keep blowing past the 32k file limit.  16 worker
threads at 10,000 files each is much larger than 32k.

Other high-FD-count services like DNS servers ask for million-file
rlimits.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-16 22:54:12 -05:00
Zygo Blaxell
c4b31bdd5c extent scan: no need for "No ref for extent" debug message
While a snapshot is being deleted, there will be a continuous stream of
"No ref for extent" messages.  This is a common event that does not need
to be reported.

There is an analogous situation when a call to open() fails with ENOENT.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-14 15:02:39 -05:00
Zygo Blaxell
08fe145988 context: wait for btrfs send to finish, then try dedupe again
Dedupe is not possible on a subvol where a btrfs send is running:

    BTRFS warning (device dm-22): cannot deduplicate to root 259417 while send operations are using it (1 in progress)

btrfs informs a process with EAGAIN that a dedupe could not be performed
due to a running send operation.

It would be possible to save the crawler state at the affected point,
fork a new crawler that avoids the subvol under send, and resume the
crawler state after a successful dedupe is detected; however, this only
helps the intersection of the set of users who have unrelated subvols
that don't share extents, and the set of users who cannot simply delay
dedupe until send is finished.  The simplest approach is to simply stop
and wait until the send goes away.

The simplest approach is taken here.  When a dedupe fails with EAGAIN,
affected Tasks will poll, approximately once per transaction, until the
dedupe succeeds or fails with a different error.

bees dedupe performance corresponds with the availability of subvols that
can accept dedupe requests.  While the dedupe is paused, no new Tasks can
be performed by the worker thread.  If subvols are small and isolated
from the bulk of the filesystem data, the result will be a small but
partial loss of dedupe performance during the send as some worker threads
get stuck on the sending subvol.  If subvols heavily share extents with
duplicate data in other subvols, worker threads will all become blocked,
and the entire bees process will pause until at least some of the running
sends terminate.

During the polling for btrfs send, the dedupe Task will hold its dst
file open.  This open FD won't interfere with snapshot or file delete
because send subvols are always read-only (it is not possible to delete
a file on a RO subvol, open or otherwise) and send itself holds the
affected subvol open, preventing its deletion.  Once the send terminates,
the dedupe will terminate soon after, and the normal FD release can occur.

This pausing during btrfs send is unrelated to the
`--workaround-btrfs-send` option, although `--workaround-btrfs-send` will
cause the pausing to trigger less often.  It applies to all scan modes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-14 14:51:28 -05:00
Zygo Blaxell
bb09b1ab0e roots: drop method transid_re
There are no callers of this method any more, and it exposes more
of BeesRoots than we really want things to have access to.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-13 23:19:43 -05:00
Zygo Blaxell
94d9945d04 roots: move the transid cache update into transid_max_nocache()
All callers of the `transid_max_nocache` method update `m_transid_re`
with the return value, so do that in `transid_max_nocache` itself.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-13 23:19:43 -05:00
Zygo Blaxell
a02588b16f time: add more methods to support dynamic rate throttling
* Allow RateLimiter to change rate after construction.
 * Check range of rate argument in constructor.
 * Atomic increment for RateEstimator.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:10:15 -05:00
Zygo Blaxell
21cedfb13e bytevector: rename the argument to operator[] to be more descriptive
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:10:15 -05:00
Zygo Blaxell
b9abcceacb progress: move the "finished" tag to a column where it won't obscure data
The "done" pointer and the "%done" fields are still useful because they
indicate _actual_ progress, not the work that has been _promised_.
So it is possible for a crawl to be "finished" (all extents queued)
but not "100.0000%" (some of those extents still active or in the queue).

"deferred" state isn't particularly useful, so drop it.

"finished" state implies no ETA, so that column is unused.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:10:15 -05:00
Zygo Blaxell
31f3a8d67d progress: relabel the inaccurate ETA column
ETA is calculated using a sample obtained by snooping on bees's normal
crawling operations.

This sample is heavily biased and not representative of the entire
filesystem.  If the distribution of extent sizes in the filesystem is
not uniform, the ETA can be wildly wrong.

Collecting an accurate sample set would require extra IO and CPU time
which should be spent doing dedupes instead.

Explicitly label the ETA as inaccurate to avoid having too many users
report the same bug.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:10:15 -05:00
Zygo Blaxell
9beb602b16 task: ignore paused status while calculating dynamic thread count
bees might be unpaused at any time, so make sure that the dynamic load
calculation is ready with a non-zero thread count.

This avoids a delay of up to 5 seconds when responding to SIGUSR2
when loadavg tracking is enabled.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:10:15 -05:00
Zygo Blaxell
0580c10082 main: add support for pause (SIGUSR1) and resume (SIGUSR2)
These are simple on/off switches for the task queue.  They are lightweight
requests for bees to be paused temporarily, but allow bees to release
open files and save progress while paused.

These signals are an alternative to SIGSTOP and SIGCONT, or using the
cgroup freezer's FROZEN and THAWED states, which pause and resume the
bees process, but do not allow the bees process to release open files
or save progress.  Snapshot and file deletes can occur on the filesystem
while bees is paused by SIGUSR1 but not by SIGSTOP.

These signals are also an alternative to SIGTERM and restart, which
flush out the whole hash table and progress state on exit, and read
the whole table back into memory on restart.

This feature is experimental and may be replaced by a more general
configuration or runtime control mechanism in the future.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 23:01:19 -05:00
Zygo Blaxell
1cbc894e6f task: start up more worker threads when unpausing
When paused, TaskConsumer threads will eventually notice the paused
condition and exit; however, there's nothing to restart threads when
exiting the paused state.

When unpausing, and while the lock is already held, create TaskConsumer
threads as needed to reach the target thread count.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 22:53:00 -05:00
Zygo Blaxell
d74862f1fc fs: set the correct nr_items to 0 in the ENOENT search case
Commit 72c3bf8438 ("fs: handle ENOENT
within lib") was meant to prevent exceptions when a subvol is deleted.

If the search ioctl fails, the kernel won't set nr_items in the
ioctl output, which means `nr_items` still has the input value.  When
ENOENT is detected, `this->nr_items` is set to 0, then later `*this =
ioctl_ptr->key` overwrites `this->nr_items` with the original requested
number of items.

This replaced the ENOENT exception with an exception triggered by
interpreting garbage in the memory buffer.  The number of exceptions
was reduced because the memory buffers are frequently reused, but upper
layers would then reject the data or ignore it because it didn't match
the key range.

Fix by setting `ioctl_ptr->key.nr_items`, which then overwrites
`this->nr_items`, so the loop that extracts items from the ioctl data
gets the right number of items (i.e. zero).

Fixes: 72c3bf8438 ("fs: handle ENOENT within lib")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-12 22:48:15 -05:00
Zygo Blaxell
e40339856f readahead: use the right parameter order when checking the range
In some cases the offset and size arguments were flipped when checking to
see if a range had already been read.  This would have been OK as long as
the same mistake had been made consistently, since `bees_readahead_check`
only does a cache lookup on the parameters, it doesn't try to use them to
read a file.  Alas, there was one case where the correct order was used,
albeit a relatively rare one.

Fix all the calls to use the correct order.

Also fix a comment:  the recent request cache is global to all threads.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-04 11:17:44 -05:00
Zygo Blaxell
1dd96f20c6 fs: drop extra declaration of hexdump
hexdump was moved into a template in its own header years ago, but
the declaration of the implementation that used to be in fs.cc remains.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-04 11:17:44 -05:00
Zygo Blaxell
cd7a71aba3 hexdump: be a little more lock-friendly
hexdump processes a vector as a contiguous sequence of bytes, regardless
of V's value type, so hexdump should get a pointer and use uint8_t to
read the data.

Some vector types have a lock and some atomics in their operator[], so
let's avoid hammering those.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-03 23:39:33 -05:00
Zygo Blaxell
e99a505b3b bytevector: don't deadlock on operator<<
operator<< was a friend class that locked the ByteVector, then invoked
hexdump on the bytevector, which used ByteVector::operator[]...which
locked the ByteVector, resulting in a deadlock.

operator<< shouldn't be a friend class anyway.  Make hexdump use the
normal public access methods for ByteVector.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-03 23:39:33 -05:00
Zygo Blaxell
3e89fe34ed roots: avoid copying a BtrfsIoctlSearchKey
Although all the members of BtrfsExtentDataFetcher are theoretically
copiable, there's no need to actually make any such copy.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-03 16:54:14 -05:00
Zygo Blaxell
dc74766179 context: spell "progress" correctly
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-02 09:50:28 -05:00
Zygo Blaxell
3a33a5386b context: add a PROGRESS: header in $BEESSTATUS
Make it clearer where the progress information goes.

Also add placeholder text so the progress section isn't empty at startup,
when the progress hasn't been calculated yet.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 11:41:59 -05:00
Zygo Blaxell
69e9bdfb0f docs: post-5.7 toxic extent handling
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:52 -05:00
Zygo Blaxell
7a197e2f33 bees: post-kernel-5.7 toxic extent handling
Toxic extents are mostly gone in kernel 5.7 and later.  Increase the
timeout for toxic extent handling to reduce false positives, and remove
persistenly stored toxic hashes from the hash table.

Toxic hashes are still stored nonpersistently to help mitigate problems
due to any remaining kernel bugs.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:52 -05:00
Zygo Blaxell
43d38ca536 extent scan: don't serialize dedupe and LOGICAL_INO when using extent scan mode
The serialization doesn't seem to be necessary for the extent scan mode.
No infinite loops in the kernel have been observed in the past two years,
despite never having used MultiLock for the extent scanner.

Leave the serialization for now on the subvol scanners.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:52 -05:00
Zygo Blaxell
7b0ed6a411 docs: default scan mode is 4, "extent"
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
8d4d153d1d main: set default scan mode to mode 4 (EXTENT)
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
d5a6c30623 docs: old missing features are not missing any more
The extent scan mode has been implemented (partially, but close enough
to win benchmarks).

New features include several nuisance dedupe countermeasures.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
25f7ced27b docs: add scan mode 4, "extent"
Extent is a different kind of scan mode, so introduce the concept of
the two kinds of scan mode, and rearrange the description of scan modes
along the new boundaries.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
c1af219246 progress: squeeze the progress table into 80 columns or less
We don't need the subvol numbers since they're only interesting to
developers.

We don't need both max and min sizes, pick one and drop the other.

Replace "16E" with "max"--it is the same number of characters, but
doesn't require the user to know what 1<<64 is off the top of their head.

Shorten "remain" to "todo" because sometimes those extra two columns
matter.

Drop the seconds field in ETA timestamps.  Long scan arrival times are
years away, and short scan arrival times are only updated once every
5 minutes, so the extra precision isn't useful.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
9c183c2c22 progress: put the progress table in the stats and status files
Make the progress information more accessible, without having to
enable full debug log and fish it out of the stream with grep.

Also increase the progress log level to INFO.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
59f8a467c3 extent scan: fix crawl_map creation
There are two crawl_maps in extent scan's next_transid:  one gets
initialized, the other gets used.  This works OK as long as bees is
resuming an existing scan, because the two maps are identical; however,
but it fails if bees is starting without an existing set of crawl data,
and one of the two maps is empty or partially filled.

The failure is intermittent, as the crawl map is being populated at
the same time next_transid runs.  It will eventually be completed after
several transaction cycles, at which point bees runs normally.
It does add significant delays during startup for benchmarks.

There's only one crawl_map in extent scan, it always has the same
crawlers, and extent scan's `next_transid` creates it by itself.
Ignore the map from BeesRoots/BeesCrawl.

Also throw in some missing but helpful trace statements.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
9987aa8583 progress: estimate actual data sizes for progress report
Replace pointers in the "done" and "total" columns with estimated data
sizes for each size tier.  The estimation is based on statistics
collected from extents scanned during the current bees run.

Move the total size for the entire filesystem up to the heading.

Report the _completed_ position (i.e. the one that would be saved in
`beescrawl.dat`), not the _queued_ position (i.e. the one where the
next Task would be created in memory).

At the end of the data, the crawl pointer ends up at some random point
in the filesystem just after the newest extent, so the progress gets to
99.7% and then goes to some random value like 47% or 3%, not to 100%.
Report "deferred" in the "done" column when the crawler is waiting for
the next transid, and "finished" in the "%done" column when the crawler
has reached the end of the data.  Suppress the ETA when finished.  This
makes it clear that there's no further work to do for these crawlers.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
da32667e02 docs: add event counters for extent scan
Add a section for all the new extent scan event counters.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
8080abac97 extent scan: refactor BeesScanMode so derived classes decide their own scan scheduling
BeesScanModeExtent uses six scan Tasks instead of one, which leads
to awkwardness like the do_scan method to tell crawl_roots how to do
what it shouldn't need to know how to do anyway.

Move the crawl_roots logic into the ::scan methods themselves.

This also deletes the very popular "crawl_more ran out of data" message.
Extent scan explicitly indicates when a scan is complete, so there's
no longer a need to fish this message out of the log.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
1e139d0ccc extent scan: put all the refs in a single Task, sort them, use idle task
The sorting avoids problematic read orders, like extent refs in the same
inode with descending offsets, that btrfs is not optimized for.

Putting everything in one Task keeps the queue sizes small, and
manages the lock contention much more calmly.

We only want to be mapping extent refs if there's not enough extents
already in the queue to keep worker threads busy, so use the `idle()`
method instead of `run()`.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
6542917ffa extent scan: introduce SCAN_MODE_EXTENT
The EXTENT scan mode reads the extent tree, splits it into tiers by
extent size, converts each tiers's extents into subvol/inode/offset refs,
then runs the legacy bees dedupe engine on the refs.

The extent scan mode can cheaply compute completion percentage and ETA,
so do that every time a new transid is observed.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-12-01 00:17:51 -05:00
Zygo Blaxell
b99d80b40f task: add an idle queue
Add a second level queue which is only serviced when the local and global
queues are empty.

At some point there might be a need to implement a full priority queue,
but for now two classes are sufficient.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
099ad2ce7c fs: add some performance metrics for TREE_SEARCH_V2 calls
These give some visibility into how efficiently bees is using the
TREE_SEARCH_V2 ioctl.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
a59a02174f table: add a simple text table renderer
This should help clean up some of the uglier status outputs.

Supports:

 * multi-line table cells
 * character fills
 * sparse tables
 * insert, delete by row and column
 * vertical separators

and not much else.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
e22653e2c6 docs: remove "matched_" prefix event counters
We can no longer reliably determine the number of hash table matches,
since we'll stop counting after the first one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
44810d6df8 scan_one_extent: remove the unreadahead after benchmark results
That unreadahead used to result in a 10% hit on benchmarks.  Now it's
closer to 75%.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
8f92b1dacc BeesRangePair: drop the _really_ expensive toxic extent workaround
We were doing a `LOGICAL_INO` ioctl on every _block_ of a matching extent,
just to see how long it takes.  It takes a while!

This could be modified to do an ioctl with the `IGNORE_OFFSET` flag,
once per new extent, but the kernel bug was fixed a long time ago, so
we can start removing all the toxic extent code.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
0b974b5485 scan_one_extent: in skip/scan lines, log whether extent is compressed
Useful for debugging the compressed-zero-block cases.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
ce0367dafe scan_one_extent: reduce the number of LOGICAL_INO calls before finding a duplicate block range
When we have multiple possible matches for a block, we proceed in three
phases:

1.  retrieve each match's extent refs and put them in a list,
2.  iterate over the list converting viable block matches into range matches,
3.  sort and flatten the list of range matches into a non-overlapping
list of ranges that cover all duplicate blocks exactly once.

The separation of phase 1 and 2 creates a performance issue when there
are many block matches in phase 1, and all the range matches in phase
2 are the same length.  Even though we might quickly find the longest
possible matching range early in phase 2, we first extract all of the
extent refs from every possible matching block in phase 1, even though
most of those refs will never be used.

Fix this by moving the extent ref retrieval in phase 1 into a single
loop in phase 2, and stop looping over matching blocks as soon as any
dedupe range is created.  This avoids iterating over a large list of
blocks with expensive `LOGICAL_INO` ioctls in an attempt to improve the
match when there is no hope of improvement, e.g. when all match ranges
are 4K and the content is extremely prevalent in the data.

If we find a matched block that is part of a short matching range,
we can replace it with a block that is part of a long matching range,
because there is a good chance we will find a matching hash block in
the long range by looking up hashes after the end of the short range.
In that case, overlapping dedupe ranges covering both blocks in the
target extent will be inserted into the dedupe list, and the longest
matches will be selected at phase 3.  This usually provides a similar
result to that of the loop in phase 1, but _much_ more efficiently.

Some operations are left in phase 1, but they are all using internal
functions, not ioctls.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
54ed6e1cff docs: event counter updates after fixing counter names and scan_one_extent improvements
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
24b08ef7b7 scan_one_extent: eliminate nuisance dedupes, drop caches after reading data
A laundry list of problems fixed:

 * Track which physical blocks have been read recently without making
 any changes, and don't read them again.

 * Separate dedupe, split, and hole-punching operations into distinct
 planning and execution phases.

 * Keep the longest dedupe from overlapping dedupe matches, and flatten
 them into non-overlapping operations.

 * Don't scan extents that have blocks already in the hash table.
 We can't (yet) touch such an extent without making unreachable space.
 Let them go.

 * Give better information in the scan summary visualization:  show dedupe
 range start and end points (<ddd>), matching blocks (=), copy blocks
 (+), zero blocks (0), inserted blocks (.), unresolved match blocks
 (M), should-have-been-inserted-but-for-some-reason-wasn't blocks (i),
 and there's-a-bug-we-didn't-do-this-one blocks (#).

 * Drop cached data from extents that have been inserted into the hash
 table without modification.

 * Rewrite the hole punching for uncompressed extents, which apparently
 hasn't worked properly since the beginning.

Nuisance dedupe elimination:

 * Don't do more than 100 dedupe, copy, or hole-punch operations per
 extent ref.

 * Don't split an extent or punch a hole unless dedupe would save at
 least half of the extent ref's size.

 * Write a "skip:" summary showing the planned work when nuisance
 dedupe elimination decides to skip an extent.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
97eab9655c types: add shrink_begin and shrink_end methods for BeesFileRange and BeesRangePair
These allow trimming of overlapping dedupes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
05bf1ebf76 counters: fix counter names for scan_eof, scan_no_fd, scanf_deferred_inode
This code gets moved around from time to time and ends up with the
wrong prefix.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
606ac01d56 multilock: allow turning it off
Add a master switch to turn off the entire MultiLock infrastructure for
testing, without having to remove and add all the individual entry points.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
72c3bf8438 fs: handle ENOENT within lib
This prevents the storms of exceptions that occur when a subvol is
deleted.  We simply treat the entire tree as if it was empty.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
72958a5e47 btrfs-tree: accessors for TreeFetcher classes' type and tree values
Sometimes we have a generic TreeFetcher and we need to know which tree
it came from.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
f25b4c81ba btrfs-tree: add root refs and extent flags fields
Lazily filling in accessor methods for btrfs objects as needed by bees.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
a64603568b task: fix try_lock argument description
try_lock allows specification of a different Task to be run instead of
the current Task when the lock is busy.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
33cde5de97 bees: increase file cache size limits
With some extents having 9999 refs, we can use much larger caches for
file descriptors.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
5414c7344f docs: resolve_overflow limit is only 655050 when BTRFS_MAX_EXTENT_REF_COUNT is
Use the current header value in the doc.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
8bac00433d bees: reduce extent ref limit to 9999
Originally the limit was 2730 (64KiB worth of ref pointers).  This limit
was a little too low for some common workloads, so it was then raised by
a factor of 256 to 699050, but there are a lot of problems with extent
counts that large.  Most of those problems are memory usage and speed
problems, but some of them trigger subtle kernel MM issues.

699050 references is too many to be practical.  Set the limit to 9999,
only 3-4x larger than the original 2730, to give up on deduplication
when each deduped ref reduces the amount of space by no more than 0.01%.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
088cbc951a docs: event counter updates after readahead sanity improvements
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
e78e05e212 readahead: inject more sanity at the foundation of an insane architecture
This solves a third bad problem with bees reads:

3.  The architecture above the read operations will issue read requests
for the same physical blocks over and over in a short period of time.

Fixing that properly requires rewriting the upper-level code, but a
simple small table of recent read requests can reduce the effect of the
problem by orders of magnitude.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
8d08a3c06f readahead: inject some sanity at the foundation of an insane architecture
This solves some of the worst problems with bees reads:

1.  The kernel readahead doesn't work.  More precisely, it's much better
adapted for a very different use case:  a single thread alternating
between reading a file sequentially and processing the data that was read.
bees has multiple threads which compete for access to IO and then issue
reads in random order immediately after the call to readahead.  The kernel
uses idle ioprio scheduling for the readaheads, so the readaheads get
preempted by the random reads, or cancels the readaheads because the
data access pattern isn't sequential after the readahead was issued.

2.  Seeking drives perform terribly with multiple competing readers,
especially with btrfs striped profiles where the iops are broken into
tiny stripe-sized pieces.  At one point I intended to read the btrfs
device map and figure out which devices can be read in parallel, but to
make that useful, the user needs to have an array with multiple drives
in single profile, or 4+ drives in raid1 profile.  In all other cases,
the elaborate calculations always return the same result:  there can be
only one reader at a time.

This commit fixes both problems:

1.  Don't use the kernel readahead.  Use normal reads into a dummy
buffer instead.

2.  Allow only one thread to readahead at any time.  Once the read is
completed, the data is in the page cache, and all the random-order small
reads that bees does will hit the page cache, not a spinning disk.
In some cases we need to read two things close together, so add a
`bees_readahead_pair` which holds one lock across both reads.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
cdcdf8e218 hash: use kernel readahead instead of bees_readahead to prefetch hash table
The hash table is read sequentially and from a single thread, so
the kernel's implementation of readahead is appropriate here.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
37f5b1bfa8 docs: add allocator regression in 6.0+ kernels
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
abe2afaeb2 context: when a task fails to acquire an extent lock, don't go ahead and scan the extent anyway
Commit c3b664fea5 ("context: don't forget
to retry locked extents") removed the critical return that prevents a
Task from processing an extent that is locked.

Put the return back.

Fixes: c3b664fea5 ("context: don't forget to retry locked extents")
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
792fdbbb13 fs: get rid of 16 MiB limit on dedupe requests
The kernel has not required a 16 MiB limit on dedupe requests since
v4.18-rc1 b67287682688 ("Btrfs: dedupe_file_range ioctl: remove 16MiB
restriction").

Kernels before v4.18 would truncate the request and return the size
actually deduped in `bytes_deduped`.  Kernel v4.18 and later will loop
in the kernel until the entire request is satisfied (although still
in 16 MiB chunks, so larger extents will be split).

Modify the loop in userspace to measure the size the kernel actually
deduped, instead of assuming the kernel will only accept 16 MiB.
On current kernels this will always loop exactly once.

Since we now rely on `bytes_deduped`, make sure it has a sane value.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
30a4fb52cb Revert "context: add experimental code for avoiding tiny extents"
because this problem is better solved elsewhere.

This reverts commit 11fabd66a8.
2024-11-30 23:30:33 -05:00
Zygo Blaxell
90d7075358 usage: the default scan mode is 3 (recent)
The code and docs were changed some time ago, but not the usage message.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
faac895568 docs: add the 6.10..6.12 delayed refs bug
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
a7baa565e4 crawl: rename next_transid() to avoid confusion with BeesScanMode::next_transid()
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:30:33 -05:00
Zygo Blaxell
b408eac98e trace: add file and line numbers all the way up the stack
These were added to crucible all the way back in 2018 (1beb61fb78
"crucible: error: record location of exception in what() message")
but it's even more useful in the stack tracer in bees.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-11-30 23:27:24 -05:00
Zygo Blaxell
75131f396f context: reduce the size of LOGICAL_INO buffers
Since we'll never process more than BEES_MAX_EXTENT_REF_COUNT extent
references by definition, it follows that we should not allocate buffer
space for them when we perform the LOGICAL_INO ioctl.

There is some evidence (particularly
https://github.com/Zygo/bees/issues/260#issuecomment-1627598058) that
the kernel is subjecting the page cache to a lot of disruption when
trying allocate large buffers for LOGICAL_INO.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-04-17 23:14:35 -04:00
Zygo Blaxell
cfb7592859 usage: the default scan mode is 1 (independent)
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-04-17 23:14:35 -04:00
Zygo Blaxell
3839690ba3 lib: fix btrfs_data_container pointer casts for 32-bit userspace on 64-bit kernels
Apparently reinterpret_cast<uint64_t> sign-extends 32-bit pointers.
This is OK when running on a 32-bit kernel that will truncate the pointer
to 32 bits, but when running on a 64-bit kernel, the extra bits are
interpreted as part of the (now very invalid) address.

Use <uintptr_t> instead, which is unsigned, integer, and the same word
size as the arch's pointer type.  Ordinary numeric conversion can take
it from there, filling the rest of the word with zeros.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2024-04-17 23:07:41 -04:00
Zygo Blaxell
124507232f docs: add vmalloc bug to kernel bugs list
The bug is:

	v6.3-rc6: f349b15e183d mm: vmalloc: avoid warn_alloc noise caused by fatal signal

The fixes are:

	v6.4: 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
	v6.3.10: c189994b5dd3 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails

The bug has been backported to LTS, but the fix has not:

	v6.2.11: 61334bc29781 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
	v6.1.24: ef6bd8f64ce0 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
	v5.15.107: a184df0de132 mm: vmalloc: avoid warn_alloc noise caused by fatal signal

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-07-06 13:50:12 -04:00
Zygo Blaxell
3c5e13c885 context: log when LOGICAL_INO returns 0 refs
There was a bug in kernel 6.3 where LOGICAL_INO with IGNORE_OFFSET
sometimes fails to ignore the offset.  That bug is now fixed, but
LOGICAL_INO still returns 0 refs much more often than seems appropriate.

This is most likely because bees frequently deletes extents while there
is still work waiting for them in Task queues.  In this case, LOGICAL_INO
correctly returns an empty list, because every reference to some extent
is deleted, but the new extent tree with that extent removed is not yet
committed in btrfs.

Add a DEBUG-level log message and an event counter to track these events.
In the absence of a kernel bug, the debug message may indicate CPU time
was wasted performing a search whose outcome could have been predicted.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-07-06 12:54:33 -04:00
Zygo Blaxell
a6ca2fa2f6 docs: add IGNORE_OFFSET regression in 6.2..6.3 to kernel bugs list
This doesn't impact the current bees master, but it does break bees-next.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-07-06 12:49:36 -04:00
Zygo Blaxell
3f23a0c73f context: downgrade toxic extent workaround message
Toxic extents are much less of a problem now than they were in kernels
before 5.7.  Downgrade the log message level to reflect their lesser
importance.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-07-06 12:49:36 -04:00
Zygo Blaxell
d6732c58e2 test: GCC 13 fix for limits.cc
GCC complains that #include <cstdint> is missing, so add that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-05-07 21:24:21 -04:00
Zygo Blaxell
75b2067cef btrfs-tree: fix build on clang++16
The "loops" variable isn't read (only set) if not built with extra
debug code.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-05-07 21:23:27 -04:00
Zygo Blaxell
da3ef216b1 docs: working around btrfs send issues isn't really a feature
The critical kernel bugs in send have been fixed for years.
The limitations that remain aren't bugs, and bees has no sustainable
workaround for them.

Also update copyright year range.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-03-07 10:25:51 -05:00
Zygo Blaxell
b7665d49d9 docs: fill in missing LTS backports for "1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match"
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-03-07 10:17:44 -05:00
Zygo Blaxell
717bdf5eb5 roots: make sure transid_max's computed value isn't max
We check the result of transid_max_nocache(), but not the result of
transid_max().  The latter is a computed result that is even more likely
to be wrong[citation needed].

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:45:29 -05:00
Zygo Blaxell
9b60f2b94d docs: add "missing" features that have been in development for some time already
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:42:42 -05:00
Zygo Blaxell
8978d63e75 docs: update GCC versions list and clarify markdown statement
I don't know if anyone else is testing GCC versions before 8.0 any more,
but I'm not.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:39:55 -05:00
Zygo Blaxell
82474b4ef4 docs: update front page
At least one user was significantly confused by "designed for large
filesystems".

The btrfs send workarounds aren't new any more.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:38:50 -05:00
Zygo Blaxell
73834beb5a docs: minor changes to how-it-works based on past user questions
Clarify that "too large" and "too small" are some distance away from each other.
The Goldilocks zone is _wide_.

The interval between cache drops is now shorter.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:37:37 -05:00
Zygo Blaxell
c92ba117d8 docs: various gotcha updates
Fixing the obviously wrong and out of date stuff.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:37:23 -05:00
Zygo Blaxell
c354e77634 docs: simplify the exit-with-SIGTERM description
The description now matches the code again.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:36:44 -05:00
Zygo Blaxell
f21569e88c docs: update the feature interactions page
Fixing the obviously out-of-date and no-longer-tested things.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:34:22 -05:00
Zygo Blaxell
3d5ebe4d40 docs: update kernel bugs and workarounds list for 6.2.0
Remove some of the repetition to make the document easier to edit.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-25 03:32:52 -05:00
Zygo Blaxell
3430f16998 context: create a Pool of BtrfsIoctlLogicalInoArgs objects
Each object contains a 16 MiB buffer, which is very heavy for some
malloc implementations.

Keep the objects in a Pool so that their buffers are only allocated and
deallocated once in the process lifetime.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-23 22:45:31 -05:00
Zygo Blaxell
7c764a73c8 fs: allow BtrfsIoctlLogicalInoArgs to be reused, remove virtual methods
Some malloc implementations will try to mmap() and munmap() large buffers
every time they are used, causing a severe loss of performance.

Nothing ever overrode the virtual methods, and there was no virtual
destructor, so they cause compiler warnings at build time when used with
a template that tries to delete pointers to them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-23 22:40:12 -05:00
Zygo Blaxell
a9a5cd03a5 ProgressTracker: reduce memory usage with long-running work items
ProgressTracker was only freeing memory for work items when they reach
the head of the work tracking queue.  If the first work item takes
hours to complete, and thousands of items are processed every second,
this leads to millions of completed items tracked in memory at a time,
wasting gigabytes of system RAM.

Rewrite ProgressHolderState methods to keep only incomplete work items
in memory, regardless of the order in which they are added or removed.

Also fix the unit tests which were relying on the memory leak to work,
and add test cases for code coverage.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-23 22:33:35 -05:00
Zygo Blaxell
299509ce32 seeker: fix the test for ILP32 platforms
Not sure what I was thinking, but the argument here should clearly
be uint64_t.

Fixes: https://github.com/Zygo/bees/issues/248
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-20 11:30:56 -05:00
Zygo Blaxell
d5a99c2f5e roots: don't share a RootFetcher between threads
If the send workaround is enabled, it is possible for two threads (a
thread running the crawl_new task, and a thread attempting to apply the
send workaround) to access the same RootFetcher object at the same time.
That never ends well.

Give each function its own BtrfsRootFetcher object.

Fixes: https://github.com/Zygo/bees/issues/250
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-02-20 11:14:34 -05:00
Kai Krakow
fd6c3b3769 Makefile: also drop fiemap and fiewalk from main Makefile
Fixes: ccd8dcd43f
Signed-off-by: Kai Krakow <kai@kaishome.de>
2023-01-28 11:21:51 +01:00
Zygo Blaxell
849c071146 hash: flush the table more slowly
With SIGTERM and fast exit, the trickle writeback is less important.
We don't want to flood people's IO subsystems with continuous writes.
This really should be configurable at runtime.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-27 22:16:02 -05:00
Zygo Blaxell
85ff543695 test: simplify Makefile
Make can build dependencies in parallel, so let Make do that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-27 22:16:02 -05:00
Zygo Blaxell
8147f80a5a src: bees-version.cc cleanups
Do rebuild bees-version.cc if libcrucible changes.
Don't rebuild bees-version.cc if it doesn't change.
Also use the standard suffix for new files.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-27 22:16:02 -05:00
Zygo Blaxell
cbde237f79 src: simplify Makefile
Make can build dependencies in parallel, so let Make do that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-27 22:16:02 -05:00
Zygo Blaxell
3b85fc8bc7 lib: drop version.cc entirely
crucible::VERSION doesn't make much sense now that libcrucible no
longer exists as a shared library.  Nothing ever referenced it, so
it can go away.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-27 22:16:02 -05:00
Zygo Blaxell
4df1b2c834 lib: simplify dependency generation
We don't need to run all the dependencies first, Make can do those in parallel.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-27 22:16:02 -05:00
Zygo Blaxell
495218104a fd: FS_IOC_SETFLAGS takes an int* argument not a long*
According to ioctl_iflags(2):

	The type of the argument given to the FS_IOC_GETFLAGS and
	FS_IOC_SETFLAGS  operations is int *, notwithstanding the
	implication in the kernel source file include/uapi/linux/fs.h
	that the argument is long *.

So this code doesn't work on be64 machines.

Also, Valgrind complains about it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-27 22:16:02 -05:00
Zygo Blaxell
e82ce3c06e fd: pwrite returns ssize_t not int
A subtle distinction, and not one that is particularly relevant to bees,
but it does make toolchains complain.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-27 22:16:02 -05:00
Zygo Blaxell
bd336e81a6 fs: get rid of base class btrfs_ioctl_logical_ino_args
Another instance of the pattern where we derived a crucible class
from a btrfs struct.  Make it an automatic variable instead.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-27 22:16:02 -05:00
Zygo Blaxell
ea17c89165 fs: remove duplicate BTRFS_COMPRESS_ definitions
This was fixed in

	7f660f50b lib: fs: stop using libbtrfs-dev helper functions to re-enable buffer length checks

but apparently some copies live on.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-27 22:16:02 -05:00
Zygo Blaxell
ccd8dcd43f fiemap, fiewalk: drop dead example/test code
These tools are obsolete.  fiemap was a thin wrapper around FIEMAP,
but FIEMAP is not useful on btrfs.  fiewalk was a thin wrapper around
BtrfsExtentWalker, but development on BtrfsExtentWalker has been
abandoned.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-23 00:09:26 -05:00
Zygo Blaxell
facf4121a6 context: remove the one call to operator vector<> method in BtrfsIoctlLogicalInoArgs
There's only one user of this method.  Open-code it so we can kill the
method in libcrucible.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-23 00:09:26 -05:00
Zygo Blaxell
cbc76a7457 hash: don't spin when writes fail
When a hash table write fails, we skip over the write throttling because
we didn't report that we successfully wrote an extent.  This can be bad
if the filesystem is full and the allocations for writes are burning a
lot of CPU time searching for free space.

We also don't retry the write later on since we assume the extent is
clean after a write attempt whether it was successful or not, so the
extent might not be written out later when writes are possible again.

Check whether a hash extent is dirty, and always throttle after
attempting the write.

If a write fails, leave the extent dirty so we attempt to write it out
the next time flush cycles through the hash table.  During shutdown
this will reattempt each failing write once, after that the updated hash
table data will be dropped.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-23 00:09:26 -05:00
Zygo Blaxell
28ee2ae1a8 docs: fix broken link in options.md
Links in docs/ are relative to docs/, not the top level.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-23 00:08:54 -05:00
Zygo Blaxell
d27621b779 main: catch exceptions and exit gracefully
Calling 'bees -m4' should not call 'std::terminate()', but it does.

Use catch_all instead.  It will still pass the exit value to return
from main.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-05 01:10:17 -05:00
Zygo Blaxell
cb2c20ccc9 fs: get rid of base class btrfs_ioctl_same_extent_info
We only use BtrfsExtentInfo when it's exactly equivalent to the
base, so drop the derived class.

While we're here, fix BtrfsExtentSame::add so it uses a btrfs-compatible
uint64_t instead of an off_t.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-05 01:10:17 -05:00
Zygo Blaxell
ded5bf0148 btrfs-tree: fix whitespace and const
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-05 01:10:17 -05:00
Zygo Blaxell
d5de012a17 btrfs-tree: translate item types for error messages
Look up the name when filling in the what() field for the exception.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-05 01:10:17 -05:00
Zygo Blaxell
66d1e8a89b btrfs-tree: add chunk items: length and type
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-05 01:10:17 -05:00
Zygo Blaxell
c327e0bb10 readahead: report the original size in BEESTOOLONG
BEESTOOLONG was always reporting a size of zero, and the offset of the
end of the readahead region.  Report the original size instead (and also
in BEESTRACE and BEESNOTE).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-05 01:10:17 -05:00
Zygo Blaxell
9587c40677 docs: add crawl_again, drop crawl_restart
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-05 01:10:17 -05:00
Zygo Blaxell
a115587fad roots: fix extent lock failure handling
Drop the crawl_restart counter, it doesn't happen here (or anywhere else).

Add the crawl_again counter for extents that are restarted due to an
extent-level lock.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-05 01:10:17 -05:00
Zygo Blaxell
af6ecbc69b trace: use pthread_setname wrapper
libcrucible can deal with the Linux kernel and/or libc's thread name
limitations.  No need to duplicate that work in bees.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-05 01:10:17 -05:00
Zygo Blaxell
563e584da4 task: use pthread_setname_np correctly
It turns out I've been using pthread_setname_np wrong the whole time:

 * on Linux, the thread name length is 15 characters.
   TASK_COMM_LEN is 16 bytes, and the last one is always 0.
   This is now hardcoded in many places and cannot be changed.

 * pthread_setname_np doesn't return -errno, so DIE_IF_MINUS_ERRNO
   was the wrong macro.  On the other hand, we never want to do anything
   differently when pthread_setname_np fails, so we never needed to
   check the return value.

Also, libc silently ignores attempts to set the thread name when it is too
long.  That's almost certainly a libc bug, but libc probably suppresses
the error result for the same reasons I ignore the error result.

Wrap the pthread_setname function with a C++ std::string overload that
truncates the argument at 15 characters, so we at least get the first
part of the task name in the thread name field.  Later commits can deal
with making the bees thread names shorter.

Also wrap pthread_getname for symmetry.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-05 01:10:17 -05:00
Zygo Blaxell
c5889049f0 docs: remove duplicate (and wrong) default scan mode
The default scan mode is found in config.md.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2023-01-05 01:10:17 -05:00
Adam Faiz
ecaed09128 docs: fix reference direction
The Dependencies list is above the Packaging section, not below.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-29 06:25:33 -05:00
Zygo Blaxell
64dab81e42 Merge github PR #148
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-23 00:26:33 -05:00
Zygo Blaxell
cfcdac110b context: don't count MultiLock waiting time in dedup_ms
This was inflating the dedup_ms statistic because it was counting all
the resolve time too.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-22 23:46:36 -05:00
Zygo Blaxell
c3b664fea5 context: don't forget to retry locked extents
The caller of scan_forward has to stop advancing the BeesFileCrawl
position when an extent lock blocks a scan, so that it will resume
from the same position when the Task is scheduled again; otherwise,
bees simply skips over the extent and leave it incompletely deduped.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-22 23:46:36 -05:00
Hilton Chain
66b00f8a97 beesd: Honor DESTDIR on installation.
Co-authored-by: Adam Faiz <adam.faiz@disroot.org>
Signed-off-by: Hilton Chain <hako@ultrarare.space>
2022-12-23 11:10:17 +08:00
Zygo Blaxell
bbcfd9daa6 roots: replace BEES_TRANSID_FACTOR with BEES_TRANSID_POLL_INTERVAL
Restart crawl_more (and update crawl roots and flush FD caches) every
time the transid changes, and only when the transid changes, but
not more often than a reasonable minimum poll interval.

Clean up the log message:  use the proper thread name and remove
the wildly inaccurate estimate of when crawl will resume.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:01 -05:00
Zygo Blaxell
d6d3e1045e context: keep the resolve cache smaller
We don't need to cache 65536 extent maps, especially if each one
can have almost 700K references.

Valgrind's massif tool points to the extent map cache as a very
large memory allocator, but test runs with memcg disagree.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:01 -05:00
Zygo Blaxell
d5d17cbe62 roots: run insert_new_crawl from within a Task
If we have loadavg targeting enabled, there may be no worker threads
available to respond to new subvols, so we should not bother updating
the subvols list.

Put insert_new_crawl into a Task so it only executes when a worker
is available.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:01 -05:00
Zygo Blaxell
48dd2a45fe docs: remove the line discussing 'max_transid' in recent scan mode
This makes the doc match the code again.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:01 -05:00
Zygo Blaxell
7267707687 roots: disable recent sorting by max_transid
On large filesystems where the min_transid of all subvols gets stuck at 0,
bees may lose the ability to effectively track recent data.  A secondary sort
by max_transid will allow scanning newer subvols that were created after bees
started running on the filesystem, but before bees completed the first scan
of all subvols.

On the other hand, the secondary sort does a reverse version of the
sequential scan mode, and the sequential scan mode is simply awful.

Disable it for now.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:01 -05:00
Zygo Blaxell
984ceeb2a5 docs: update documentation for new 'recent' scan mode
Also attempted to clarify the descriptions of the modes based on
feedback and questions from users over the years.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:01 -05:00
Zygo Blaxell
03f809bf22 roots: reimplement scan modes using virtual base and methods
Split each scan mode into two distinct phases:

    1.  A heavy discovery phase, where we search the entire filesystem
    for something (new items in subvol trees in this case).

    2.  A light consuming phase, where we fetch extents to dedupe
    from places that we found in the discovery phase.

Part 1 recomputes the subvol ordering every time there is a new transid.
For some scan modes this computation is quite expensive, far too costly
to pay for every extent, so we do it no more than once per transaction.

Part 2 is run every time a worker thread hits the crawl_more Task.
It simply pulls one extent from the first crawler off a sorted list,
removing the crawler from the list when the crawler runs out of data.

Part 1 creates a new structure and swaps it into place, while Part 2
continues to run using the previous strucuture.  Neither of these
need to block the other, so they don't.

The separate class and base pointer also make it easer to add new scan
modes that are not based on subvol trees or that don't use BeesCrawl.

While we're here, fix up some method visibility in BeesRoots.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:01 -05:00
Zygo Blaxell
0dca6f74b0 roots: remove duplicate default scan mode setting
Set the constructor's default scan mode to an invalid mode, so if we
change the default, we don't have to update two places.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:01 -05:00
Zygo Blaxell
f5c4714a28 roots: add 'recent' crawl mode for a mix of new and old data
Crawl mode 3 'recent' prioritizes data from new updates to previously
scanned subvols over subvols that have not been completely scanned yet.
If no such new data exists, falls back to a variation of 'lockstep'
scan mode.

This enables us to keep up with new data as it arrives, a key weakness
of all the other scan modes, and worth violating our unwritten "no new
scan modes until we have extent-tree dedupe working" policy for.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:00 -05:00
Zygo Blaxell
de96a38460 roots: emit "crawl finished" at the correct time
The correct time is when we set the deferred bit after a tree
search returns empty.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:00 -05:00
Zygo Blaxell
82c2b5bafe roots: improve thread status tracking messages
Don't dereference a shared_ptr inside a thread status function.

Do trace the crawl start events.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:00 -05:00
Zygo Blaxell
d725f3c66c context: process PREALLOC extents synchronously in extent's Task worker
Inode-oriented scan workers must do all of their work sequentially,
so it's counterproductive to spawn a Task to do a background dedupe.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:00 -05:00
Zygo Blaxell
84f91af503 context: don't let multiple worker Tasks get stuck on a single extent or inode
When two Tasks attempt to lock the same extent, append the later Task
to the earlier Task's post-exec work queue.  This will guarantee that
all Tasks which attempt to manipulate the same extent will execute
sequentially, and free up threads to process other extents.

Similarly, if two scanner threads operate on the same inode, any dedupe
they perform will lock out other scanner threads in btrfs.  Avoid this
by serializing Task objects that reference the same file.

This does theoretically use an unbounded amount of memory, but in practice
a Task that encounters a contended extent or inode quickly stops spawning
new Tasks that might increase the queue size, and all Tasks that might
contend for the same lock(s) end up on a single FIFO queue.

Note that the scope of inode locks is intentionally global, i.e. when
an inode is locked, it locks every inode with the same number in every
subvol.  This avoids significant lock contention and task queue growth
when the same inode with the same file extents appear in snapshots.

Fixes: https://github.com/Zygo/bees/issues/158
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:00 -05:00
Zygo Blaxell
31d26bcfc6 roots: organize scan workers by inode instead of extent
Split crawlers into two separate Tasks:

 1. a Task which locates the next inode with a new data extent.

 2. a Task which scans every new extent in that inode.

This simplifies some lock contention and execution ordering issues.
Files are read sequentially.  Workers dynamically scale up or
down as needed, without creating thousands of deferred Task objects.
Workers obtain inode locks for different inodes in btrfs, so they
can work in parallel instead of waiting for each other.

This change in behavior comes with new names for the worker Tasks:

        "crawl_master" is now "crawl_more", the singular Task which
        creates inode-scanning Tasks.

        "crawl_<subvol>" is now "crawl_<subvol>_<inode>".

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:00 -05:00
Zygo Blaxell
e13c62084b roots: use scan mode 'independent' by default
Independent subvol scanners fairly consistently outperform either
of the correlated scan modes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:00 -05:00
Zygo Blaxell
7cef1133be roots: use symbolic names for SCAN_MODEs
This was done on the development branch three years ago, and
has been creating annoying merge conflicts ever since.  Sync
up the branches so they have the same names for these.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:51:00 -05:00
Zygo Blaxell
f98599407f roots: rework btrfs send workaround using btrfs-tree
Drop the cache since we no longer have to open a file every time we
check a subvol's status.

Also stop counting workaround events at the root level twice.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:59 -05:00
Zygo Blaxell
4d59939b07 btrfs-tree: introduce lightweight classes for btrfs tree search operations
btrfs-tree provides classes for low-level access to btrfs tree objects.

An item class is provided to decode polymorphic btrfs item fields.

Several tree classes provide forward and backward iteration over raw
object items at different tree levels.

A csum tree class provides convenient access to csums by bytenr,
supporting all current btrfs csum types.

Wrapper classes for inode and subvol items provide direct access to
btrfs metadata fields without clumsy stat() wrappers or ioctls.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:59 -05:00
Zygo Blaxell
24b904f002 seeker: backward searching template function
This template turns a forward search primitive (e.g. lower_bound, FIEMAP,
TREE_SEARCH_V2) into a backward search primitive.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:59 -05:00
Zygo Blaxell
23c16aa978 BeesFileRange: coalesce is not used, subtract was never implemented
Less dead code to maintain.  Also more Doxygen comments.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:59 -05:00
Zygo Blaxell
152e69a6d1 bytevector: validate length in get<T>()
Don't allow a pointer to T to be taken from a ByteVector that is not at
least sizeof(T) bytes long.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:58 -05:00
Zygo Blaxell
148cc03060 bytevector: do not deadlock in self-assignment
Not that this is a particularly useful use case, but it will lock up,
and it should not.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:58 -05:00
Zygo Blaxell
b699325a77 bytevector: don't need _all_ of those mutexes
Methods that don't even look at the pointer don't need a mutex.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:58 -05:00
Zygo Blaxell
a59d89ea81 bytevector: add some fugly mutexes
We are using ByteVectors from multiple threads in some cases.  Mostly
these are the status and progress threads which read the ByteVector
object references embedded in BEESNOTE macros.

Since it's not clear what the data race implications are, protect
the shared_ptr in ByteVector with a mutex for now.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:58 -05:00
Zygo Blaxell
d1015b683f bytevector: add ostream output with hexdump
There is a hexdump template in fs.  Move hexdump to its own header,
then ByteVector can use it too.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:58 -05:00
Zygo Blaxell
9cdeb608f5 bees: drop the balance/logical workaround that has been disabled for two years
Kernels that needed the balance workaround frankly are too buggy
to run bees at all.  The workaround also makes the locking stories
around logical_ino calls and process exit complicated, so get rid of
it completely.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:58 -05:00
Zygo Blaxell
83a2b010e6 context: drop long-dead ExtentWalker code
At some point BtrfsExtentWalker will be fully deprecated and removed from
bees.  Might as well start with code that hasn't been built in 6 years.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:58 -05:00
Zygo Blaxell
31b2aa3c0d context: speed up orderly process termination
Quite often bees exceeds its service timeout for termination because
it is waiting for a loop embedded in a Task to finish some long-running
btrfs operation.  This can cause bees to be aborted by SIGKILL before
it can completely flush the hash table or save crawl state.

There are only two important things SIGTERM does when bees terminates:
 1.  Save crawl progress
 2.  Flush out the hash table

Everything else is automatically handled by the kernel when the process
is terminated by SIGKILL, so we don't have to bother doing it ourselves.
This can save considerable time at shutdown since we don't have to wait
for every thread to reach a point where it becomes idle, or force loops
to terminate by throwing exceptions, or check a condition every time we
access a pointer.  Instead, we need do only the things in the list
above, and then call _exit() to clean up everything else.

Hash table and crawl state writeback can happen in their background
threads instead of the foreground one.  Separate the "stop" method for
these classes into "stop_request" and "stop_wait" so that these writebacks
can run at the same time.

Deprecate and remove all references to the BeesHalt exception, and remove
several unnecessary checks for BeesContext::stop_requested.

Pause the task queue instead of cancelling it, which preserves the
crawl progress state and stops new Tasks from competing for iops and
CPU during writeback.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:58 -05:00
Zygo Blaxell
594ad1786d context: dump current load tracking stats
Dump the instantaneous load (last 5 seconds, extracted from load average)
and the computed target worker count (before rounding and truncation)
on the same status line as the task and worker thread count.

This should give better visibility into Task's thread count calculation
algorithm.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:57 -05:00
Zygo Blaxell
b143664747 task: use exponential backoff algorithm to set thread count
Tasks are often running longer than 5 seconds (especially extents with
multiple references requiring copy operations), so the load tracking
algorithm needs to average several samples over a longer period of time
than 5 seconds.  If the sample period is 60 seconds, we end up recomputing
the original load average from current_load, so skip the rounding error
and use the original load average value.

Arguably the real fix is to break up the more complex extent operations
over several downstream Task objects, but that's a more significant
design change.

Tweak the attack and decay rates so that threads are started a little
more slowly, but still stopped rapidly when load spikes up.

Remove the hysteresis to provide support for load average targets
below 1, or with fractional components, with a PWM-like effect.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:57 -05:00
Zygo Blaxell
a85ada3a49 task: export load tracking statistics
Provide an interface so that programs can monitor the Task load
average calculations.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:57 -05:00
Zygo Blaxell
46a38fe016 task: rescue post-exec queue on Task destruction
task1.append(task2) is supposed to run task2 after task1 is executed;
however, if task1 was just executed, and its last reference was owned by
a TaskConsumer, then task2 will be appended to a Task that will never
run again.

A similar problem arises in Exclusion, which can cause blocked tasks
to occasionally be dropped without executing them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:57 -05:00
Zygo Blaxell
2aafa802a9 task: increase saved thread name length to 64
24 bytes seems a little low.  64 is a rounder (and more square) number.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:57 -05:00
Zygo Blaxell
cdef59e2f3 task: add more Doxygen comments for PairLock
I need to remind myself why it's there, and not just std::lock.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:57 -05:00
Zygo Blaxell
dc2dc8d08a task: delete the queue after deleting all of its children
This was resulting in an assertion failure later on if a queue was
being rescued from a deleted task with only one post-exec queue.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:57 -05:00
Zygo Blaxell
7873988dac task: add a pause() method as an alternative to cancel()
pause(true) stops the TaskMaster from processing any more Tasks,
but does not destroy any queued Tasks.

pause(false) re-enables Task processing.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:57 -05:00
Zygo Blaxell
3f740d6b2d task: simplify clear_queue
Simplify the loop in clear_queue because we can't be modifying a
queue while we are clearing it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:56 -05:00
Zygo Blaxell
c0a7533dd4 task: use const for current_consumer
The const version of this code has much more testing, but any
effect at run time is unlikely.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:56 -05:00
Zygo Blaxell
090fa39995 task: don't hold the mutex while disposing of pending Tasks
In the event that someday Barrier allows users to force execution of
its pending tasks prior to the destruction of the BarrierState object,
we'll be ready to submit those Tasks for execution without waiting for
the BarrierState mutex lock.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:56 -05:00
Zygo Blaxell
2f25f89067 task: get rid of separate Exclusion and ExclusionState
Exclusion was generating a new Task every time a lock was contended.
That results in thousands of empty Task objects which contain a single
Task item.

Get rid of ExclusionState.  Exclusion is now a simple weak_ptr to a Task.
If the weak_ptr is expired, the Exclusion is unlocked.  If the weak_ptr
is not expired, it points to the Task which owns the Exclusion.

try_lock now appends the Task attempting to lock the Exclusion directly
to the owning Task, eliminating the need for Exclusion to have one.
This also removes the need to call insert_task separately, though
insert_task remains for other use cases.

With no ExclusionState there is no need for a string argument to
Exclusion's constructor, so get rid of that too.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:56 -05:00
Zygo Blaxell
7fdb87143c task: get rid of the separate Barrier and BarrierLock
Make one class Barrier which is copiable, so we don't have to
have users making shared Barrier all the time.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:55 -05:00
Zygo Blaxell
d345ea2b78 readahead: use emulation
It seems that readahead() does not work on btrfs, or at least it has
no discernable effect.  Enable the workaround instead.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:55 -05:00
Zygo Blaxell
a2e1887c52 bees: use MultiLocker to serialize dedupe and logical_ino
In current kernels there is a bug which leads to an infinite loop in
add_all_parents().  The bug is triggered by one thread running dedupe
while another runs logical_ino.

Work around this by ensuring that bees process never runs dedupe and
logical_ino ioctls at the same time.  Any number of either can run
at the same time, but not one of both.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:55 -05:00
Zygo Blaxell
4a4a2de89f multilocker: serialize conflicting parallel operations
For performance or workaround reasons we sometimes have to avoid doing
two conflicting operations at the same time, but we can still run any
number of non-conflicting operations in parallel.

MultiLocker (suggestions for a better class name welcome) blocks the
calling thread until there are no threads attempting to run a conflicting
operation.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:54 -05:00
Zygo Blaxell
cc87125e41 bees: drop bees_sync, we will not need it
bees_sync() was an exception-trapping wrapper around fsync() which is
not needed in any of the contexts from which it was called:

	1.  dedupe operations implicitly flush the src data, so there is
	no need to call fsync() to do that twice.

	2.  crawl position is written to a temporary file and renamed
	over the original, which always forces a flush when the original
	exists.  On the first write, where there is no original, a
	crash would result in starting over with an empty or hole-filled
	beescrawl file, which is the initial state of bees.  There is also
	a long history of kernel bugs triggered by fsync() in this case.

	3.  we use unreadahead to trigger writeback for flushing the
	hash table to persistent storage.  Here is a space where we might
	use fsync after all, as part of bees_unreadahead's emulation of
	POSIX_FADV_DONTNEED, but we need to get read-once behavior from
	the scanner before we can use this capability.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:54 -05:00
Zygo Blaxell
be9321cdb3 roots: correctly track crawl dirty state
If there's an error while writing the crawl state, the state should
remain dirty.  If the crawl state is successfully written, the state
is only clean if there were no changes to crawl state since the write
was committed.  We need to release the lock while writing the state but
correctly set the dirty flag when the state is written successfully.

Replace the bool with a version number counter.  Track the last version
successfully saved and the current version of the crawl state.  The state
is dirty if these counters disagree and clean if they agree.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:54 -05:00
Zygo Blaxell
a9c81e5531 bees: drop m_parent_ctx
It has not been used since 2016.

Also drop the explicit default constructor.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:54 -05:00
Zygo Blaxell
942800ad00 fd: add some doxygen
Still very incomplete, but better than it was before.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:54 -05:00
Zygo Blaxell
21c08008e6 namedptr: add some doxygen, fix the #endif comment
Document the overall purpose of the class and what some of the methods do,
particularly the ones with terrible names like 'insert_item' (which only
inserts an item after calling the Function).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:54 -05:00
Zygo Blaxell
30ece57116 fs: export btrfs_compress_type_ntoa
We already had a function that was _similar_, so add decoding for compress
type NONE, give it a less specific name, and declare it in fs.h.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:54 -05:00
Zygo Blaxell
6556566f54 ntoa: fix type of mask
It really needs to be uint64_t, but at least it now doesn't contradict
the definition in the earlier header.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:53 -05:00
Zygo Blaxell
ece58cc910 cache: add a method to get estimated cache size
Estimated because there is no lock preventing the result from
changing before it is used.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-12-20 20:50:53 -05:00
Zygo Blaxell
331cb142e3 fs: make dedupe work again after a really unfortunate build fix
In commit 14ce81c08 "fs: get rid of silly base class that causes build
failures now" I neglected to set the dest_count field in the ioctl
arg structure, so bees master hasn't been deduping anything for about
three weeks.

I'd put a THROW_CHECK in here to catch this kind of bug in the future,
but it would be placed at exactly the point where this fix is.

Fixes: 14ce81c08
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-11-05 13:43:21 -04:00
Zygo Blaxell
5953ea6d3c fs: update btrfs compatibility header: add csum types, BTRFS_FS_INFO_FLAG_GENERATION and _METADATA_UUID
I guess this means it's "args_v3" now?

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-10-25 12:56:16 -04:00
Zygo Blaxell
07a4c9e8c0 roots: sprinkle on some more const
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-10-25 12:56:16 -04:00
Zygo Blaxell
8f6f8e4ac2 roots: make sure we can never get a uint_max transid
If we iterate over all roots to find the max transid, but the set of
all roots is empty, we'll get a nonsense number.  Make sure that number
doesn't reach the crawling logic by killing it with an exception.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-10-25 12:56:16 -04:00
Zygo Blaxell
972721016b fs: get rid of base class fiemap
Yet another build failure of the form:

	error: flexible array member fiemap... not at end of struct crucible::Fiemap...

bees doesn't use fiemap any more, so the fixes here are minimal changes
to make it build, not shining examples of C++ class design.

Signer-off-by: Zygo Blaxell <bees@furryterror.org>
2022-10-25 12:56:16 -04:00
Zygo Blaxell
5040303f50 fs: get rid of base class btrfs_data_container
This fixes another build failure of the form:

	error: flexible array member btrfs_... not at end of struct crucible::Btrfs...

Fixes: https://github.com/Zygo/bees/issues/236
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-10-23 22:42:57 -04:00
Zygo Blaxell
3654738f56 bees: fix deprecated-copy warnings for clang-14
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-10-23 22:39:59 -04:00
Zygo Blaxell
be3c54e14c extentwalker: drop explicit default constructors
They're all public because it's a struct, so there's no need to make
them explicit.  clang-14 deprecates these.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-10-23 22:39:59 -04:00
KhalilSantana
2751905f1d Fixes a bad grep pattern caused by dffd6e0
Fixes #233
2022-10-13 16:03:30 -04:00
Zygo Blaxell
587588d53f bytevector: fix length check
ByteVectors, and shared subranges thereof, might be empty.  The parameter
check should allow that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-10-10 17:40:33 -04:00
Zygo Blaxell
14ce81c081 fs: get rid of silly base class that causes build failures now
The base class thing was an ugly way to get around the lack of C99
compound literals in C++, and also to make the bare ioctls usable with
the derived classes.

Today, both clang and gcc have C99 compound literals, so there's no need
to do crazy things with memset.  We never used the derived classes for
ioctls, and for this specific ioctl it would have been a very, very bad
idea, so there's no need to support that either.  We do need to jump
through hoops for ostream& operator<<() but we had to do those anyway
as there are other members in the derived type.

So we can simply drop the base class, and build the args object on the
stack in `do_ioctl`.  This also removes the need to verify initialization.

There's no bug here since the `info` member of the base class was
never used in place by the derived class, but new compilers reject the
flexible array member in the base class because the derived class makes
`info` be not at the end of the struct any more:

	error: flexible array member btrfs_ioctl_same_args::info not at end of struct crucible::BtrfsExtentSame

Fixes: https://github.com/Zygo/bees/issues/232
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-10-09 20:39:15 -04:00
Khalil Santana
dffd6e0b13 Get rid of errors by using grep -E
"egrep: warning: egrep is obsolescent; using grep -E"
2022-10-05 23:00:37 -04:00
Zygo Blaxell
a32cd5247f docs: update kernel bugs list for 5.18 ptvf fix
Also correct my own style for the fixed version column.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-08-17 13:04:06 -04:00
Zygo Blaxell
9c68f15474 README: update copyright year 2022
It has been some years since the copyright statement was updated.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-07-29 22:20:02 -04:00
Zygo Blaxell
5f3cb9b374 docs: update kernel bugs list for 2022-07-29
* RAID1 device count problems fixed
 * log tree replay parent transid verify failure in 5.18 and 5.19 added, patches available but not upstream yet
 * flushoncommit issues fixed, discussion section removed
 * LOGICAL_INO vs dedupe hang added

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2022-07-29 22:07:26 -04:00
Ayla Ounce
a52062822a Fix beesd script arg parsing to respect PREFIX
Without this, if you install to a different PREFIX such as /usr/local
it will fail to recognize any arguments and if you use the systemd unit,
that makes --no-timestamps the first NOT_SUPPORTED_ARG which will get
passed to uuidparse, which doesn't recognize it and errors.
2022-04-10 14:12:24 -07:00
Zygo Blaxell
fbf6b395c8 types: member m_fd in BeesFileRange must be protected against data races
We had an unfortunate pattern of:

	const BeesFileRange bfr;
	shared_ptr<BeesContext> ctx;
	// ...
	BEESNOTE("foo " << bfr);
	bfr.fd(ctx);
	BEESNOTE("foo after opening: " << bfr);

If dump_status started running after the first BEESNOTE, but before
the second, then bfr.fd() might expose a single Fd object's shared_ptr
member to two threads at the same time (the thread running dump_status
and the thread running BEESNOTE) without protection by a lock.  One of
the threads would see a partially-initialized Fd object, and the other
thread would crash on an assertion failure, e.g.

	#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
	#1  0x00007f4c4fde5537 in __GI_abort () at abort.c:79
	#2  0x00007f4c4fde540f in __assert_fail_base (fmt=0x7f4c4ff4e128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5557605629dd "!m_destroyed", file=0x5557605627c0 "../include/crucible/namedptr.h", line=77, function=<optimized out>) at assert.c:92
	#3  0x00007f4c4fdf4662 in __GI___assert_fail (assertion=assertion@entry=0x5557605629dd "!m_destroyed", file=file@entry=0x5557605627c0 "../include/crucible/namedptr.h", line=line@entry=77,
	    function=function@entry=0x555760562970 "crucible::NamedPtr<Return, Arguments>::Value::~Value() [with Return = crucible::IOHandle; Arguments = {int}]") at assert.c:101
	#4  0x00005557605306f6 in crucible::NamedPtr<crucible::IOHandle, int>::Value::~Value (this=0x7f4a3c2ff0d0, __in_chrg=<optimized out>) at ../include/crucible/namedptr.h:77
	#5  0x00005557605137da in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f4a3c2ff0c0) at /usr/include/c++/10/bits/shared_ptr_base.h:151
	#6  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f4a3c2ff0c0) at /usr/include/c++/10/bits/shared_ptr_base.h:151
	#7  std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f4c4c5b5f28, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
	#8  std::__shared_ptr<crucible::IOHandle, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
	#9  std::shared_ptr<crucible::IOHandle>::~shared_ptr (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr.h:121
	#10 crucible::Fd::~Fd (this=0x7f4c4c5b5f20, __in_chrg=<optimized out>) at ../include/crucible/fd.h:46
	#11 BeesFileRange::file_size (this=0x7f4c4e5ba4a0) at bees-types.cc:156
	#12 0x0000555760513950 in operator<< (os=..., bfr=...) at bees-types.cc:80
	#13 0x000055576050d662 in std::function<void (std::ostream&)>::operator()(std::ostream&) const (__args#0=..., this=0x7f4c4e5b9f60) at /usr/include/c++/10/bits/std_function.h:622
	#14 BeesNote::get_status[abi:cxx11]() () at bees-trace.cc:165
	#15 0x00005557604c9676 in BeesContext::dump_status (this=0x5557611c4de0) at bees-context.cc:89
	#16 0x00005557605206fb in std::function<void ()>::operator()() const (this=this@entry=0x7f4c4c5b65f0) at /usr/include/c++/10/bits/std_function.h:622
	#17 crucible::catch_all(std::function<void ()> const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)> const&) (f=..., explainer=...) at error.cc:55
	#18 0x000055576050aaa7 in operator() (__closure=0x5557611c52c8) at bees-thread.cc:22
	#19 0x00007f4c501beed0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
	#20 0x00007f4c502c8ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
	#21 0x00007f4c4febddef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Fix by making BeesFileRange::m_fd really const (not just mutable),
then fix all the broken code referencing it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-12-19 15:10:02 -05:00
Zygo Blaxell
26acc6adfd bytevector: introduce BEES_VALGRIND to help work around valgrind
valgrind doesn't understand ioctl arguments, so it does not know if
or when they initialize memory, and it complains about conditionals
depending on data that comes out of ioctls.  That's a problem for bees,
where every decision we ever make is based on data an ioctl gave us.

Fix the initialization issue by using calloc instead of malloc for
ByteVectors when we are building for valgrind.  Don't enable this by
default because all the callocs aren't necessary (assuming the rest
of the code is correct) and hurt performance.

Define BEES_VALGRIND in localconf to activate, e.g.

	echo CCFLAGS += -DBEES_VALGRIND=1 >> localconf

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-12-19 15:10:02 -05:00
Zygo Blaxell
01734e6d4b hash: initialize m_dirty in BeesHashTable
It turns out we never set m_dirty's initial value.  This is not a
practical problem because 1) it's mostly harmless if m_dirty is spuriously
true, 2) we set it to true every time bees scans a data block, and 3)
the allocation happens early in startup when most memory allocations
are using zero-filled pages, so it's probably getting a false value at
construction in most cases.

valgrind complains about it, so it has to go.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-12-19 15:10:02 -05:00
Zygo Blaxell
84094c7cb9 context: use consistent status for dedupe in log and thread note
Once the physical addresses are known, put them where they can be
seen in BEESTATUS as well as the log.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-12-19 15:10:02 -05:00
Zygo Blaxell
a3d2bc26d5 progress: lock down some const methods
begin() and end() don't mutate their object

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-12-19 15:10:02 -05:00
Zygo Blaxell
d0c35b4734 fs: yet another const
References to the search key do not need to be modified.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-12-19 15:10:02 -05:00
Zygo Blaxell
a83c68eb18 bees: style cleanups: const, size_t, symbolic names
No functional changes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-12-19 15:10:02 -05:00
Zygo Blaxell
6d6686eb5b context: get rid of resolve (LOGICAL_INO) serializer
There are kernel bugs in LOGICAL_INO from time to time; however, we
can't avoid these bugs by serializing LOGICAL_INO calls.

It hasn't been used for some time, so remove the code and
less-than-completely-accurate comments.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-12-19 15:10:02 -05:00
Zygo Blaxell
007067b83f docs: add missing 'adjust_offset_hit' counter
Reported by York-Simon Johannsen via github issue 208.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-12-19 15:10:02 -05:00
suorcd
bb5160987e docs: spell "snapshot" correctly
https://github.com/Zygo/bees/pull/209

Edited: regenerate docs for the downstream change in index.md.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-12-19 15:08:26 -05:00
Zygo Blaxell
feed04c944 gitignore: clang creates a lot of *.tmp files
Also sort the list of extensions.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
670fce5be5 resolve: reword the too-many-duplicates exception message
For one thing, it should _say_ that there are too many duplicates.
We were making the user read the manual to find that out.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
ff3b5a7a1b hash: drop bees_unreadahead
Forcing the entire hash table into immediate writeback causes crippling
write latencies at shutdown.  Even discarding pages as they are read in
at startup can trigger a writeback latency spike if the pages are dirty
at read time.

Better to let the VM subsystem handle this on its own.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
13ec4b5165 hash: add utsname fields to log output
Putting this information in the logs saves us from having to ask for
the kernel version and machine name every time.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
5d7e815eb4 lib: add Uname, a constructor for utsname
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
7f67f55746 docs: remove some stray whitespace
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
0103a04ca0 task: concurrency cleanups
Update thread_local task state pointers while locked.  This avoids
potential concurrent access of the pointers while making copies of them.

Verify that the queue is really empty after splicing lists, and the
current consumer is really gone after swapping the empty one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
5e346beb2d task: delete the move constructor for TaskState
Move-constructing isn't good for that class either.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
85c93c10e6 bees: clean up #include list
No need for atomic, and sort the Linux headers.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
ba694b4881 hash: move the random generator out of bees-hash.cc
We need random numbers in more places, so centralize the engines.
Initialize with a proper random seed so every worker thread gets
different behavior.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
73f94750ec namedptr: concurrency and const cleanup
Fix the locking order for the case where an exception is thrown
in shared_ptr's allocator.

More const.

Drop the explicit closure return type since the compiler can deduce it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
5e379b4c48 readahead: update comments to reflect bakeoff results
It turns out that readahead() alone is fastest.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
6325f9ed72 lib: deprecate memset_zero template, use C99 compound literals instead
Sprinkle in some asserts to make sure compilers aren't getting creative.

This may introduce a new compiler dependency, as I suspect older versions
of GCC don't support this syntax.

It definitely needs a new compiler flag to suppress a warning when some
fields are not explicitly initialized.  If we've omitted a field, it's
because it's a field we don't know (or care) about, and we want that
thing initialized to zero.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
c698fd7211 context: stop using deprecated memset_zero template
Use ordinary literal initialization instead.  The ioctl doesn't
need initialization of args at all.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
95347a08bb fd: better error messages for pread/pwrite
Include file name and offset.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
eb2630dee6 docs: document resolve_overflow
In commit d9e3c0070b "context: stop creating
new refs when there are too many already" we added a new counter, but didn't
document it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
b828f14dd1 task: optimize for common case of single following Task
If there is only one Task in the post exec queue, we can
simply insert that Task instead of creating a task to hold
a post exec queue of one item.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
ecf110f377 context: add a comment explaining why we are not adding bees_unreadahead
At the end of scanning one extent, in theory we do not need that extent
any more.  In practice, it hurts benchmark scores if we drop the extents
after reading them.

Add a comment to note this where we put the bees_unreadhead call.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
7f7f919d08 context: fix the status message that will never be seen
BEESNOTE can only be seen if the status thread is running at the time,
making the log of activities during shutdown incomplete.

Wake up the status thread early during shutdown so the logged sequence
of shutdown actions is complete.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
11fabd66a8 context: add experimental code for avoiding tiny extents
In the current architecture we can't directly measure the physical extent
size, and we can't make good decisions with the extent data (reference)
item alone.  If the early return is enabled here, there is a small speedup
and a large drop in dedupe hit rate, especially when extent splits occur.

Leave the early return commented for now, but collect the event statistics.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Zygo Blaxell
a60c53a9e1 fs: dump the TREE_SEARCH_V2 parameters on exception
The current error message is useless.  At least say which tree we were
searching.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-11-29 21:27:48 -05:00
Javi Vilarroig
01cb75ac0e Minimal changes in beesd script to make it functional in my system 2021-11-29 20:53:04 +01:00
Zygo Blaxell
7a8d98f94d roots: use the new type argument to next_min
Tree searches are all looking for specific item types.  Skip over any
item types we are not interested in when resetting the search key for
the next search.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:59:09 -04:00
Zygo Blaxell
fcd847bbf9 fs: add an item type parameter to next_min
When we are searching the btrfs metadata trees, we usually want only
one type of item.  If the last item in a search result is not of the
desired type, we can restart the search at the next possible key with
that item type, potentially skipping over some uninteresting items we
would otherwise have to fetch, process, and discard.

Also remove a bug in the previous next_min code that would skip over
items if the offset overflowed and the next objectid in the tree had a
lower item type number than the previous objectid.  This doesn't seem
to be a bug that has ever happened, as it would require a file to roll
over in the offset field.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:56:04 -04:00
Zygo Blaxell
e861957632 roots: use default nr_items
BtrfsIoctlSearchKeyV2's constructor now fills in nr_items = 1, so we
don't need to set it explicitly any more.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:56:04 -04:00
Zygo Blaxell
fb0e676ee8 string: drop vector_copy_struct, obsoleted by ByteVector
vector_copy_struct constructed a std::vector<uint8_t> from a fixed-size
struct.  ByteVector replaces std::vector<uint8_t> and has a template
constructor which does the same thing as vector_copy_struct, so there
is no longer a need for this function.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:56:04 -04:00
Zygo Blaxell
b2db140666 spanner: drop Spanner, replaced by ByteVector
Spanner was a workaround for terrible std::vector _copy_ performance,
but it turns out that std::vector has terrible _allocator_ performance
(compared to an implementation based on malloc and memcpy).  Spanner is a
workaround for the copy performance issue, so it doesn't help very much.
Refraining from using vector at all is much better.

Now that all code that used Spanner has been converted to ByteVector,
there's no further need for Spanner<uint8_t>, which was the only type
it was ever used for.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:50:25 -04:00
Zygo Blaxell
55dc98e21a fd: finish deprecating vector<uint8_t> in IO wrapper functions
We can simply remove the template specializations, but if we do that, then
existing code might accidentally write out the vector<uint8_t> struct.

Prevent regressions by deleting the vector specializations, making any
code that uses them fail to build.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
Zygo Blaxell
14cd6ed033 bees: deprecate vector<uint8_t> and replace with ByteVector
The vector<uint8_t> in the hash table doesn't hurt very much--only a few
microseconds per 128K hash block.

The vector<uint8_t> in BeesBlockData hurts a bit more--we run that
constructor thousands of times per second.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
Zygo Blaxell
99709d889f fd: start deprecating vector<uint8_t> for p{read,write}_or_die
Add support for pread and pwrite of ByteVector objects alongside
vector<uint8_t>.  A later commit will delete the template specializations
for vector<uint8_t>, but existing users have to be updated to use
ByteVector first.

Nothing currently uses vector<char>, so we can delete that immediately.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
Zygo Blaxell
bba6f4f183 fs: convert vector<uint8_t> and Spanner to ByteVector and rewrite TREE_SEARCH_V2 wrapper
Switch various methods in fs to use ByteVector to cut down on the number
of slow allocations and copies.

Automatically determine the correct size for TREE_SEARCH_V2 buffers
based on the number of items requested, and grow the buffer as needed.
This eliminates the need to cache some objects that were heavy to create.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
Zygo Blaxell
daf8a2cde1 extentwalker: use default sizing of TREE_SEARCH_V2 buffers
Now that we can guess the size more or less automatically, there's
no need to make it unnecessarily large.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
Zygo Blaxell
ba1f3b93e4 fs: drop virtual do_ioctl methods for btrfs_ioctl_search_key
These were never used, and they make the object very slightly heavier.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
Zygo Blaxell
f0eb9b202f lib: introduce ByteVector as a replacement for vector<uint8_t> and Spanner
After some benchmarking, it turns out that std::vector<uint8_t> is
about 160 times slower than malloc().  malloc() is faster than "new
uint8_t[]" too.  Get rid of std:;vector<uint8_t> and replace it with
a lightweight wrapper around malloc(), free(), and memcpy().

ByteVector has helpful methods for the common case of moving data to and
from ioctl calls that use a fixed-length header placed contiguously with a
variable-length input/output buffer.  Data bytes are shared between copied
ByteVector objects, allowing a large single buffer to be cheaply chopped
up into smaller objects without memory copies.  ByteVector implements the
more useful parts of the std::vector API, so it can replace std::vector
objects without needing an awkward adaptor class like Spanner.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
Zygo Blaxell
2e36dd2d58 error: introduce THROW_CHECK4, the long-awaited sequel to THROW_CHECK3
Sometimes we need to check constraints on 4 variables at once.

It would be nice if variadic macros in C++ were also polymorphic.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
Zygo Blaxell
2f14a5a9c7 roots: reduce number of objects per TREE_SEARCH_V2, drop BEES_MAX_CRAWL_ITEMS and BEES_MAX_CRAWL_BYTES
This makes better use of dynamic buffer sizing, and reduces the amount
of stale date lying around.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
Zygo Blaxell
cf4091b352 endian: fix uint16_t specialization of le_to_cpu
Fortunately, we have not had cause to read any 16-bit fields out of
btrfs structures yet.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
Zygo Blaxell
587870911f roots: use const more
Mark local variables that can be const const.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
Zygo Blaxell
d384f3eec0 roots: ignore subvol when it is read-only and send workaround is enabled
Previously, when the bees send workaround is enabled, bees would
immediately advance the subvol's crawl status as if the entire subvol
had been scanned.

If the subvol is later made read-write, or if the workaround is disabled,
bees sees that the subvol has already been marked as scanned.  This is
an unfortunate result if the subvol is inadvertently marked read-only
or if bees is inadvertently run with the send workaround disabled.

Instead, (almost) completely ignore the subvol:  don't advance the crawl
pointer, don't consider the subvol in the list if searchable roots, and
don't consider the subvol when calculating min_transid for new subvols.

The "almost" part is:  if the subvol scan has not yet started, keep its
start timestamp current so it won't mess up subvol traversal performance
metrics.

Also handle exceptions while determining whether a subvol is read-only,
as those apparently do happen.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-31 19:42:01 -04:00
gin66
596f2c7dbf Remove duplicated //etc for make install
install -Dm644 scripts/beesd.conf.sample $(DESTDIR)/$(ETC_PREFIX)/bees/beesd.conf.sample
will  expand to //etc/bees/beesd.conf.sample. This patch removes the duplicated /
2021-10-31 10:41:56 +01:00
Zygo Blaxell
84adbaecf9 beesd: add missing RuntimeDirectory
Since we started locking down the beesd service, we no longer have
privileges to do some things.  Have systemd do it for us instead.

Fixes: #195
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-14 21:13:33 -04:00
Zygo Blaxell
12e80658a8 fs: fix FIEMAP_MAX_OFFSET type silliness in fiemap.h
In fiemap.h the members of struct fiemap are declared as __u64, but the
FIEMAP_MAX_OFFSET macro is an unsigned long long value:

	$ grep FIEMAP_MAX_OFFSET -r /usr/include/
	/usr/include/linux/fiemap.h:#define FIEMAP_MAX_OFFSET   (~0ULL)
	$ grep fe_length -r /usr/include/
	/usr/include/linux/fiemap.h:    __u64 fe_length;   /* length in bytes for this extent */

This results in a type mismatch error on architectures like ppc64le:

	fiemap.cc:31:35: note:   deduced conflicting types for parameter 'const _Tp' ('long unsigned int' and 'long long unsigned int')
	    31 |                 fm.fm_length = min(fm.fm_length, FIEMAP_MAX_OFFSET - fm.fm_start);
	       |                                ~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Work around this by copying the macro into a uint64_t constant,
and not using the macro any more.

Fixes: https://github.com/Zygo/bees/issues/194

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-06 15:17:02 -04:00
Zygo Blaxell
b436f8483b docs: add readahead_ event group
readahead and unreadahead have new event counters.  Document them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-04 20:44:25 -04:00
Zygo Blaxell
a353d8cc6e hash: use POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED
The hash table is one of the few cases in bees where a non-trivial amount
of page cache memory will be used in a predictable way, so we can advise
the kernel about our IO demands in advance.

Use WILLNEED to prefetch hash table pages at startup.

Use DONTNEED to trigger writeback on hash table pages at shutdown.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-04 20:41:09 -04:00
Zygo Blaxell
97d70ef4c5 bees: readahead() in the kernel is posix_fadvise(..., POSIX_FADV_WILLNEED)
In theory, we don't need the pread() loop, because the kernel will do a
better job with readahead().

In practice, we might still need the pread() code, as the readahead will
occur at idle IO priority, which could adversely affect bees performance.

More testing is required.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-04 20:21:01 -04:00
Zygo Blaxell
a9cd19a5fe fs: avoid unaligned access when copying btrfs search headers
The assignment operator will use member-wise assignment, which
assumes the object's this pointer is aligned.  That doesn't
happen when the object in question is part of a btrfs search
result, and aarch64 faults over it.

Use memcpy instead, which has no alignment constraints.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-10-04 20:19:00 -04:00
Jiahao XU
69c3d99552 Rm MOUNT_OPTIONS for it is of no use and dangerous
Btrfs mount options effects all mount points using the same Btrfs
partition, so specifing it per-mount is useless.

Also, common mount options like `noatime,nosuid,nodev,noexec` has little
to no effect on beesd, so it's just better and simpler to remove this.

Signed-off-by: Jiahao XU <Jiahao_XU@outlook.com>
2021-10-04 20:19:00 -04:00
Jiahao XU
ccec63104c Update default MOUNT_OPTIONS beesd.in
`noatime` to avoid updating atime;
`nodev,noexec,nosuid` for the pedantic.
2021-10-04 20:19:00 -04:00
Jiahao XU
951b5ce360 Fix typo when setting default val of MOUNT_OPTIONS in beesd.in
Fixed mistake in #188
2021-10-04 20:18:55 -04:00
Jiahao XU
f2c65f2f4b Update comment in beesd@.service.in
Signed-off-by: Jiahao XU <Jiahao_XU@outlook.com>
2021-09-04 21:20:05 +10:00
Jiahao XU
c79eb1d704 Further sandbox beesd using systemd.exec options
I've verified that using this setup, user will be able to access the log
in /run/bees, but cannot access the mounted filesystem.

Signed-off-by: Jiahao XU <Jiahao_XU@outlook.com>
2021-09-04 17:40:13 +10:00
Zygo Blaxell
522e52618e context: calculate TOTAL RATES correctly
The denominator for TOTAL RATES is the total running time, not the delta
running time.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-08-30 18:23:42 -04:00
Jiahao XU
4a3d3e7a43 Modify systemd unit and beesd.in to use private mnt namespace
to:
 - avoid influencing the global mount namespace
 - auto umount upon exit of this unit

Signed-off-by: Jiahao XU <Jiahao_XU@outlook.com>
2021-08-30 18:23:38 -04:00
Jiahao XU
13abf8aada Add new options MOUNT_OPTIONS
Signed-off-by: Jiahao XU <Jiahao_XU@outlook.com>
[trailing whitespace deleted]
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-08-30 18:22:30 -04:00
Kai Krakow
081a6af278 bees: Avoid unused result with -Werror=unused-result
Fixes: commit 20b8f8ae0b ("bees: use helper function for readahead")
Signed-off-by: Kai Krakow <kai@kaishome.de>
2021-06-19 10:35:28 +02:00
Zygo Blaxell
3d95460eb7 fiemap: don't force flush so we can see the delalloc shenanigans
Like filefrag, fiemap was defaulting to FIEMAP_FLAG_SYNC, and providing no
option to turn it off.  This prevents observation of delayed allocations,
making fiemap less useful.

Override the default flag setting so fiemap gets the current
(i.e. unflushed) extent map state.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 21:09:14 -04:00
Zygo Blaxell
d9e3c0070b context: stop creating new refs when there are too many already
LOGICAL_INO_V2 has a maximum limit of 655050 references per extent.
Although it no longer has a crippling performance problem, at roughly
two seconds to process extent, it's too slow to be useful.

When an extent gains an absurd number of references, stop making any
more.  Returning zero extent refs will make bees believe the extent
was deleted, and it will remove the block from the hash table.

This helps speed processing of highly duplicated large files like
VM images, and the cost of a slightly lower dedupe hit rate.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 21:05:55 -04:00
Zygo Blaxell
955b8ae459 task: set the name of consumer threads so it is not "load_tracker"
The default name of a newly constructed thread is apparently the name
of the thread that created it.  That's very misleading when there are
a lot of TaskConsumer threads and they have nothing to do, so set the
name of each TaskConsumer thread as soon as it is created.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 21:02:00 -04:00
Zygo Blaxell
08899052ad trace: current_exception() is not a replacement for uncaught_exception()
In 15ab981d9e "bees: replace uncaught_exception(), deprecated in C++17",
uncaught_exception() was replaced with current_exception(); however,
current_exception() is only valid after an exception has been captured
by a catch block.

BeesTracer wants to know about exceptions _before_ they are caught,
so current_exception() is not useful here.

Instead, conditionally compile using uncaught_exception() or
uncaught_exceptions(), selected by C++ standard version, and make
bees stack traces work again.

Fixes: 15ab981d9e "bees: replace uncaught_exception(), deprecated in C++17"
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
03532effed trace: move BeesTrace and BeesNote into their own translation unit
This allows these components to be used by test executables without
pulling in all of bees, and more rapidly iterate their code.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
6adaedeecd extentwalker: fix the binary search and add some debug infrastructure
Add some conditionally-compiled debug code, including an in-memory log
of what ExtentWalker does.  Dump that log on exceptions.

If we loop too many times in a debug build, kill the process so we can
stack trace.  In non-debug builds just throw a normal exception.

Grow the step size instead of shrinking it, to reduce the number of
binary search iterations.

Prevent a bug where the step size bottoms out before positioning the
target extent in the middle of the result vector.

Use the first extent for "first_extent", instead of the 3rd.

Get rid of some redundant checks.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
54f03a0297 extentwalker: fix missing characters
"C" in LOGICAL_INO, and avoid writing "flags=" in the log.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
52279656cf extentwalker: fix the hole position logic
When a file ends with a hole, ExtentWalker synthesizes a hole extent record
to cover the distance between the last ipos and EOF.  Unfortunately, ipos
was incremented by the number of items in the result vector instead.  Fix
that by incrementing by hole_extent.size().

While we're here, fix up some of the other data quality logic, including
a useless THROW_CHECK that was nothing but workarounds for earlier bugs.

Fixes: https://github.com/Zygo/bees/issues/26
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
1fd26a03b2 tracer: annotate both ends of the stack trace
Add a matching "--- BEGIN TRACE..." line to complement the "---  END
TRACE..." line.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
b083003cf7 docs: update kernel bugs table as of 5.12.3
Two new tree mod log bugs #5 and #6 (uncovered by the zoned IO work,
though #6 has been seen in the wild on 5.10.29).

Tweak the next of some of the workarounds.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
b2d4a07c6f roots: add a TRACE for transid_max search and crawl_transid thread
Some users are hitting an exception somewhere in crawl_transid, which
forces bees to return back to the transid_max calculation over and over.
Also out-of-range transids.

Add some BEESTRACE so we can see what we were doing in the exception
handler.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
7008c74113 bees: trace and log improvements during roots and context startup
Currently if crawl throws an exception, we don't have basic information
about what was being crawled or even if the crawler was running at all.

These traces also help identify the causes of early exception failures.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
5f0f7a8319 bees: increase StringFile size limit
If we are going to dedupe thousands of subvols, we are going to need a
bigger beescrawl.dat.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
ee86b585a5 bees: use a reserved symbol name in BEESLOG
"c" could be a local variable name, which would do interesting things
to some log messages.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
cf4b5417c9 context: remove unnecessary copies
These were added while debugging a crash that was fixed years ago.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
77ef6a0638 roots: split constructor into separate start method
This allows us to use the fd cache and inode resolve functions
without starting crawler threads.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
0f0da21198 context: track record extent reference counts
This might be interesting information, though most of the motivation for
this evaporated when kernel 5.7 came out.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
8a70bca011 bees: misc comment updates
These have been accumulating in unpublished bees commits.  Squash them all
into one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
20b8f8ae0b bees: use helper function for readahead
There seem to be multiple ways to do readahead in Linux, and only some
of them work.  Hopefully reading the actual data is one of them.

This is an attempt to avoid page-by-page reads in the generic dedupe code.
We load both extents into the VFS cache (read sequentially) and hope they
are still there by the time we call dedupe on them.

We also call readahead(2) and hopefully that either helps or does nothing.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:54 -04:00
Zygo Blaxell
0afd2850f4 cache: emit log messages when clearing FD cache
This enables us to correlate FD cache clears with external events such
as btrfs inode eviction storms.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:56:46 -04:00
Zygo Blaxell
ffac407a9b roots: clean up crawl_master
Remove some broken #if 0 code, and take advantage of new Task
non-repeating execution semantics.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
4f032ab85b context: report Task instance count
Report the number of Task objects that currently exist as well as the number
on the global work queue.

	THREADS (work queue 298 of 2385 tasks, 16 workers):

This helps spot leaks, since Task objects that are blocked on other Task
post-exec queues are otherwise invisible.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
5f763f6d41 task: handle thread lifecycle more strictly
Testing sometimes crashes during exec of the first Task object, which
triggers construction of TaskConsumer threads.  Manage the life cycle
of the thread more strictly--don't access any methods of TaskConsumer
or std::thread until the constructor's caller's lock on TaskMaster
is released.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
0928362aab task: replace waiting state with run/exec counter
Task::run() would schedule a new execution of Task, unless it was waiting
on a queue for execution.  This cannot be implemented with a bool,
since a Task might be included in multiple queues, and should still be
in waiting state even when executed in that case.

Replace the bool with a counter.  run() and append() (but not
append_nolock) increment the counter, exec() decrements the counter.
If the counter is non-zero when run() or append() is called, the Task
is not scheduled.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
d5ff35eacf task: track number of Task objects in program and provide report
This is a simple lightweight counter that tracks the number of Task
objects that exist.  Useful for leak detection.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
b7f9ce3f08 task: serialize Task execution when Tasks block due to mutex contention
Quite often we want to execute task B after task A finishes executing,
especially if tasks A and B attempt to acquire locks on the same objects.

Implement that capability in Task directly:  each Task holds a queue
of Tasks which will be executed strictly after this Task has finished
executing, or if the Task is destroyed.

Add a local queue to each TaskConsumer.  This queue contains a list
of Tasks which are to be executed by a single thread in sequential
order.  These tasks are executed before fetching any tasks from
TaskMaster.

Each time a Task finishes executing, the list of tasks appended to the
recently executed Task are spliced at the beginning of the thread's
TaskConsumer local queue.  These tasks will be executed in the same
thread in the same order they were appended to the recently executed Task.

If a Task is destroyed with a post-execution queue, that queue is
also inserted at the front of the current TaskConsumer's local queue.

If a Task is destroyed or somehow executed outside of a TaskConsumer
thread, or a TaskConsumer thread is destroyed, the local queue of Tasks
is wrapped in a "rescue_task" Task, and spliced before the head of the
global queue.  This preserves the sequential ordering of tasks.

In all cases the order of sequential execution of Tasks that are
appended to another Task is preserved.

The unused queue insertion functions are removed.

Exclusion is now simply a mutex, a bool, and a Task with an empty
function.  Tasks that queue up waiting for the mutex are stored in
Exclusion's Task, and Exclusion simply runs that task when the
ExclusionState is released.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
592580369e docs: btrfs-kernel: add the extent ref hash bug
Fixed in 5.11 and 5.10 but _not_ 5.10 or 5.4 (yet).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
0bbaddd54c docs: finally concede that the consensus spelling is "dedupe"
Change documentation and comments to use the word "dedupe," not "dedup"
as found in circa-3.15 kernel sources.

No changes in code or program output--if they used "dedup" before, they
will continue to be spelled "dedup" now.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
06a46e2736 chatter: add option to remove log level prefix
Some projects use only one log level, so there is no need to repeat it
for every line.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
45afce72e3 test: fd: note when bad cast exception is expected
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
e4c95d618a crucible: use '#include "crucible/...' everywhere
Make the #include syntax more consistent (even if it has no effect).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
7cffad5fc3 fd: make the close method on IOHandle private
Fd's cache does not handle changes in the state of its IOHandle parameter.
If we allow:

	Fd f;
	f->close();

then Fd ends up caching a pointer to a closed Fd, and will become very
badly confused if a new Fd appears with the same int identifier.

Fix by removing the close method.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
06062cfd96 pool: use weak_ptr to run destructor earlier
Drop the ListType alias because we only use it once.  Rename ListRep to
PoolRep to better reflect what it does.

We don't need the Pool to be available to handle destroyed Pool::Handle
objects.  A weak_ptr in the Handle would detect the Pool has been
destroyed, so we don't need to track that ourselves.  As a bonus, we can
destroy the PoolRep object as soon as the Pool has been destroyed, delayed
only if there is a Handle object currently executing its destructor.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
fbd1091052 options: remove default 8 CPU thread limit
Higher CPU core counts became more common, and kernel bugs became less
common, since the arbitrary 8-thread limit was introduced.  We can remove
the limit now, and treat any remaining scaling inefficiency as a bug to
be removed.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
032c740678 process: SIGCLD is not portable
MUSL libc doesn't have it, for instance.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
Zygo Blaxell
5b72f35657 src: bees depends on libcrucible.a
The dependency was missing, so changes to the library would not trigger
a rebuild of the bees binary.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-06-11 20:49:15 -04:00
SeerLite
3bf6db0354 install.md: Update Arch Linux instructions
bees is now available in the community repository.

Also changed AUR installation line to something more generic.
2021-06-11 13:21:41 -04:00
Zygo Blaxell
80c69f1ce4 context: get rid of shared_ptr<BeesContext> in every single cached Fd object
Support for multiple BeesContext objects sharing a FdCache was wasting
significant space and atomic inc/dec memory cycles for no good reason
since the shared-FdCache feature was deprecated.

open_root and open_root_ino still need a BeesContext to work.  Pass the
BeesContext pointer through the function object instead of the cache
key arguments.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-04-28 21:54:00 -04:00
Zygo Blaxell
db65031c2b context: get rid of all instances of pthread_cancel
pthread_cancel doesn't really work properly.  It was only being used in
bees to bring threads to a stop if the BeesContext is destroyed early.
It is frequently implicated in core dump reports because of the fragility
of the C++ iostream / C stdio / library infrastructure, particularly
surrounding upgrades on the host running bees.  The pthread_cancel call
itself often simply fails even when it doesn't call terminate().

Defer creation of the status and progress threads until after the
BeesContext::start method is invoked.  At that point, the existing
ask-threads-nicely-to-stop code is up and running, and normal condvars
can be used to bring bees to a stop, without having to resort to
pthread_cancel.

Since we're deleting half of the BeesContext constructor in this change,
let's remove the other half too, and put an end to the deprecated support
for multiple BeesContexts sharing a process.  It's still possible to run
multiple BeesContexts, but they will not share a FD cache.  This will
allow the FD cache's keys to become smaller and hopefully save some
memory later on.

Fixes: #171

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-04-28 21:42:03 -04:00
Zygo Blaxell
243480b515 ntoa: fix comment disparaging gcc for not implementing C99 compound literals in C++
C99's "{ 0 }" notation for filling in a struct with all zeros was not
included in the C++11 standard, so gcc doesn't implement it and neither
does clang.

gcc does (did?) have issues with warnings on the same code in C99,
complaining about uninitialized struct members when "{0}" explicitly
initializes every member to a zero value.  These issues don't apply in
the C++ code where NTOA_TABLE_ENTRY_END is used.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-04-23 08:20:03 -04:00
Zygo Blaxell
f8a8704135 ntoa: fix bits_ntoa formatting and error handling
Get rid of an assert in bits_ntoa.  Throw an exception instead.

Fix hex formatting (adding "0x" before a decimal number is not
the correct way to format hex strings).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-04-23 08:20:03 -04:00
Zygo Blaxell
8a60850e32 docs: note that FIEMAP is also affected by backref performance issue
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-04-23 08:20:03 -04:00
Zygo Blaxell
9d21e6b456 docs: drop incomplete build recipe for ubuntu 14.04
The kernel from such an old distro version likely has several unfixed
bugs.  Better not to support it at all.

Users who can upgrade the kernel are probably also sophisticated enough
to fix the build issues too.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-04-23 08:20:03 -04:00
Zygo Blaxell
bcf3e7de3e uuid: drop dependency on uuid.h
The weird things distros do to the path where uuid.h gets installed
have broken bees builds for the last time.

We were only using uuid to support a legacy feature that was removed
over four years ago.

Hypothetical users who are upgrading directly from bees v0.1 should
probably restart all the crawlers anyway--there were bugs.  Also, if any
such users exist, I respect their tremendous patience with the horrible
performance all these years--bees got about 30x faster since v0.1.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-04-23 08:16:50 -04:00
Zygo Blaxell
6465a7c37c docs: btrfs-kernel: update recommended kernels list, slow backrefs bug has been backported
The slow backrefs performance improvement is confirmed by reports from
multiple users:

	* Me (5.4.60 + backref patches, 5.7 to 5.11)

	* https://github.com/Zygo/bees/issues/161 (5.8)

	* https://github.com/Zygo/bees/issues/162 (5.8)

	* IRC user S0rin (5.4.88 + backref patches)

The issue still exists, but at a significantly reduced scale:  now about
2 ms of CPU per ref on a fast machine.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-04-04 14:01:55 -04:00
Zygo Blaxell
177f393ed6 docs: btrfs-kernel: add the 5.10 performance regression, the Ctrl-C on balance kernel crash has been fixed
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-02-23 17:37:51 -05:00
Zygo Blaxell
5f40f9edb0 docs: remove libbtrfs-dev as a build-time dependency
We no longer require ctree.h from libbtrfs-dev.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-02-22 20:07:06 -05:00
Zygo Blaxell
7f660f50b8 lib: fs: stop using libbtrfs-dev helper functions to re-enable buffer length checks
The Linux kernel's btrfs headers are better than the libbtrfs-dev headers:

	- the libbtrfs-dev headers have C++ language compatibility issues

	- upstream version in Linux kernel is more accurate and up to date

	- macros in libbtrfs-dev's ctree.h hide information that would
	enable bees to perform runtime buffer length checking

	- enum types whose presence cannot be detected with #ifdef

When accessing members of metadata items from the filesystem, we want
to verify that the member we are accessing is within the boundaries of
the item that was retrieved; otherwise, a memory access violation may
occur or garbage may be returned to the caller.  A simple C++ template,
given a pointer to a structure member and a buffer, can determine that
the buffer contains enough bytes to safely access a struct member.
This was implemented back in 2016, but left unused due to ctree.h issues.

Some btrfs metadata structures have variable length despite using a
fixed-size in-memory structure.  The members that appear earliest in
the structure contain information about which following members of the
structure are used.  The item stored in the filesystem is truncated after
the last used member, and all following members must not be accessed.

'btrfs_stack_*' accessor macros obscure the memory boundaries of the
members they access, which makes it impossible for a C++ template to
verify the memory access.  If the template checks the length of the
entire structure, it will find an access violation for variable-length
metadata items because the item is rarely large enough for the entire
structure.

Get rid of all the libbtrfs-dev accessor macros and reimplement them
with the necessary buffer length checks.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-02-22 20:06:43 -05:00
Zygo Blaxell
6eb7afa65c build: include localconf everywhere
Overriding makeflags did not work from localconf in the src, lib, or
test directories.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2021-02-22 20:06:43 -05:00
Zygo Blaxell
10af3f9763 bees: remove si_addr_lsb from siginfo debug message to fix FTBFS
Apparently it is missing in newer Linux headers, making
builds fail.  We don't need it, so remove it.

Closes: #160
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-21 19:26:22 -05:00
Zygo Blaxell
636e69267e resolve: add bees.h constants for balance and logical_ino serialization
Make these workarounds configurable in src/bees.h instead of #if 0
code blocks.  Someday we'll make the constants in bees.h configurable
through a file or similar.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 18:07:36 -05:00
Zygo Blaxell
c0149d72b7 fs: use Spanner to refer to ioctl arg buffer instead of making vector copies
This avoids some allocations and copying.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 18:07:36 -05:00
Zygo Blaxell
333aea9822 lib: introduce Spanner, a pointer and size delimiting a range
Spanner<Iterator> turns a pair of pointers into a sequence container
with several of vector's methods.

A partial specialization of make_spanner is provided which uses
shared_ptr as the beginning of the range.  Some of the Spanner code
is a questionable hack in support of this.

C++20 has ranges and span, but neither is worth moving the minimum
C++ standard forward.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 18:07:36 -05:00
Zygo Blaxell
9ca69bb7ff fs: remove buffer overrun check in get_struct_ptr for non-copying containers
When we are using non-copying containers, we can't call resize() on them.
get_struct_ptr is essentially a pointer cast, so we will end up with a
pointer to a struct that extends beyond the boundaries of the container.

As long as the btrfs metadata is not corrupted, we should not have too
many problems.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 18:07:36 -05:00
Zygo Blaxell
f45e379802 fs: deprecate vector<char>
Use uint8_t when we mean uint8_t, i.e. vector<uint8_t> instead of
vector<char>.

Add a template parameter instead of vector so we can swap in a
non-copying data type.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 18:07:36 -05:00
Zygo Blaxell
180bb60cde fs: add support and workarounds for btrfs fs_info v2
Define a local copy of the header that has fields for the csum type
and length, so we can build in places that haven't caught up to kernel
5.5 headers yet.

The reason why the csum type and length are not unconditionally filled
in eludes me.  csum_length is necessarily non-zero, and the cost of
the conditional is worse than the cost of the copy, so the whole flags
dance is a WTF...but it's part of the kernel API now, so it's too late
to NAK it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 18:07:36 -05:00
Zygo Blaxell
c80af1cb4f fd: deprecate Resource in favor of NamedPtr
Rewrite Fd using a much simpler named resource template class with
a more straightforward derivation strategy.

Behavior change:  we no longer throw an exception while calling get_fd()
on a closed Fd.  This does not seem to bother any current callers except
for the tests.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 18:06:44 -05:00
Zygo Blaxell
ab5316a3da src: use correct flags for compiling .c files, fix missing dependencies
fiewalk and fiemap depend on a lot of crucible, and incremental builds
fail hard without proper dependency tracking.

All binaries must be rebuilt when makeflags changes.  This dependency
exists already in lib and test, but src was missing.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:52 -05:00
Zygo Blaxell
d2ecb4c9ea lib: namedptr: thread-safe reference counted named object store
NamedPtr provides reference-counted handles to named objects.  The object
is created the first time the associated name is used, and stored under
the associated name until the last handle is destroyed.  NamedPtr may
itself be destroyed while handles are still active.

This template is intended to replace ResourceHandle with a more general
and less invasive implementation.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:52 -05:00
Zygo Blaxell
8a2fb75462 fd: move relative path string to library
Use a single static variable located in the library, instead of
having a separate one for each compilation unit.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:52 -05:00
Zygo Blaxell
420c218c83 cache: remove unused #includes
Also fix bees-roots's missing headers.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:52 -05:00
Zygo Blaxell
6ee5da7d77 cache: clean up pointer mangling and duplicate code
std::list and std::map both have stable iterators, and list has the
splice() method, so we don't need a hand-rolled double-linked list here.

Coalesce insert() and operator() into a single function.

Drop the unused prune() method.

Move destructor calls for cached objects out from under the cache lock.
Closing a lot of files at once is already expensive, might as well not
stop the world while we do it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
b1bdd9e056 test: rebuild the tests if libcrucible.a changes
Due to a missing dependency, tests are not rebuilt when the library
changes, so tests return false results after library source changes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
0c84302d9a lib: don't rebuild libcrucible unless there is a version change
If we create an identical .version.cc then don't bother keeping it.
This prevents libcrucible from rebuilding if there are no other changes,
which in turn prevents all the binaries from rebuilding unconditionally.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
03627503ec include: #undef crc32c
Some versions of linux-libc header files define a macro named 'crc32c'.
We want to use that name too, so #undef it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
5248985da0 context: fix shutdown log messages identifying the wrong thread
We are waiting for the status thread, not the progress thread.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
d1f1c386bc tempfile: remove size limit in realign()
Now that tempfiles are using pool checkin functions to control their
size, we don't need a size limit in realign().

We keep the limit in make_copy because it's a sanity check against
letting a multi-terabyte copy operation slip through.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
6705cd9c26 context: move TempFile from TLS to Pool and fix some FdCache issues
Get rid of the thread-local TempFiles and use Pool instead.  This
eliminates a potential FD leak when the loadavg governor repeatedly
creates and destroys threads.

With the old per-thread TempFiles, we were guaranteed to have exclusive
ownership of the TempFile object within the current thread.  Pool is
somewhat stricter:  it only guarantees ownership while the checked-out
Handle exists.  Adjust the users of TempFile objects to ensure they hold
the Handle object until they are finished using the TempFile.

It appears that maintaining large, heavily-reflinked, long-lived temporary
files costs more than truncating after every use: btrfs has to write
multiple references to the temporary file's extents, then some commits
later, remove references as the temporary file is deleted or truncated.
Using the temporary file in a dedupe operation flushes the data to disk,
so nothing is saved by pretending that there is writeback pipelining and
trying to avoid flushes in truncate.  Pool provides usage tracking and
a checkin callback, so use it to truncate the temporary file immediately
after every use.

Redesign TempFile so that every instance creates exactly one Fd which
persists over the lifetime of the TempFile object.  Provide a reset()
method which resets the file back to the initial state and call it from
the Pool checkin callback.  This makes TempFile's lifetime equivalent to
its Fd's lifetime, which simplifies interactions with FdCache and Roots.

This change means we can now blacklist temporary files without having
an effective memory leak, so do that.  We also have a reason to ever
remove something from the blacklist, so add a method for that too.

In order to move to extent-centric addressing, we need to be able to
reliably open temporary files by root and inode number.  Previously we
would place TempFile fd's into the cache with insert_root_ino, but the
cache would be cleared periodically, and it would not be possible to
reopen temporary files after that happened.  Now that the TempFile's
lifetime is the same as the TempFile Fd's lifetime, we can have TempFile
manage a separate FileId -> Fd map in Roots which is unaffected by the
periodic cache clearing.  BeesRoots::open_root_ino_nocache will check
this map before attempting to open the file via btrfs root+ino lookup,
and return it through the cache as if Roots had opened the file via btrfs.

Hold a reference to BeesRoots in BeesTempFile because the usual way
to get such a reference now throws an exception in BeesTempFile's
destructor.

These changes make method BeesTempFile::create() and all methods named
insert_root_ino unnecessary, so delete them.

We construct and destroy TempFiles much less often now, so make their
constructor and destructor more informative.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
3ce00a5ebe lib: introduce Pool, a class for storing reusable anonymous objects
Pool is a place to store shared_ptrs to generated objects (T) that are
too expensive to create and destroy between individual uses, such as
temporary files.  Objects in a Pool have no distinct identity
(contrast with Cache or NamedPtr).

Users of the Pool invoke the Pool function call overload and "check out"
a shared_ptr<T> for a T object from the Pool.  When the last referencing
shared_otr<T> is destroyed, the T object is "checked in" to the Pool.

Each call of the Pool function overload checks out a shared_ptr<T> to a T
object that is not currently referenced by any other public shared_ptr<T>.

If there are no existing T objects in the Pool, a new T is constructed
by calling the generator function.

The clear() method destroys all checked in T objects owned by the Pool
at the time the method is called.  T objects that are checked out are
not affected by clear(), and they will be stored in the Pool when they
are checked in.

If the checkout function is provided, it is called on a shared_ptr<T>
during checkout, before returning to the caller.

If the checkin function is provided, it is called on a shared_ptr<T>
before returning it to the Pool.  The checkin function must not throw
exceptions.

The Pool may be destroyed while T objects are checked out of the Pool.
In that case, when the T objects are checked in, the T object is
immediately destroyed without calling the checkin function.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
1086900a9d string: second argument to stoull is technically a nullptr
This comes up if too many compiler warnings are enabled.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
97c167d63a fs: don't zero-fill btrfs data containers
The kernel does it already, and we gain a little performance here because
we do it so often.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
b49d458792 context: move prealloc dedupe to a separate Task
A prealloc extent reference can be deduped immediately and asynchronously.
There is no need to slow down extent scanning to do it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
1e7dbc6f97 tempfile: remove old comments about fsync and deadlock bugs
I was never able to prove a connection between fsync() and deadlock bugs.
There were too many deadlock bugs to be able to isolate a bug that is
triggered specifically by fsync.

Update the comment (which has been unchanged since kernel 4.14).  We still
may want to do fsync() on temporary files someday, but there's a full
internal API rewrite between here and there.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
459071597b fs: make operator<() for search ioctl inline
Perf blames this operator for >1% of instructions with -O2, and
70% of instructions without -O2.

Let the compiler inline the function.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
187d12fc25 fs: always use container's actual size not requested size
The requested size may not match the final size of the container,
so consistently use the container's size after prepare(), not the
requested size.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
de6282c6cd roots: separate crawl sizes into bytes and items
Number of items should be low enough that we don't have too many stale
items, but high enough to amortize system call overhead to a reasonable
ratio.

Number of bytes should be constant:  one worst-case metadata page (the
btrfs limit is 64K, though 16K is much more common) so that we always
have enough space for one worst-case item; otherwise, we get EOVERFLOW
if we set the number of items too low and there's a big item in the tree,
and we can't make further progress.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
d332616eff roots: report the search parameters on tree search ioctl error
There are lots of ways the search can fail, but it's hard to pick one
without knowing the parameters.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
e654e29f45 bees: move usage message out of source file and fix a few inaccuracies
It's a pain to read, edit, and format large blocks of text in C++ code,
so rip the usage message out of bees.cc and put it in a plain text file.
Use a minimal translator to convert it into a C string.

While we're here, remove the multiple roots feature from the command
line synopsis, as we don't really support it any more.  Also clarify
that "id 5" is "subvol id 5", and describe in one sentence what
workaround-btrfs-send does.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
bbaf55b2b0 roots: make it build with clang
Remove an unnecessary cast that was breaking namespace lookup for clang.

Closes: #159

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
62a20ebf9c chatter: make it build with clang
Silence the unused variable warning.  The compiler is correct, but we
may implement line-level debug at some point in the future, so we
want to keep the member and parameters.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
7ec19d1eff clang: fix struct/class declaration/definition mismatches
clang does not like a defined class to be declared as a struct.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
b7b18d9fa1 extentwalker: make it build with clang
Remove unused MAX_OFFSET.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
f263c8751e bees context: make it build with clang
Remove unused function getenv_or_die.  All of our environment variable
parameters are optional or have default values.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
17d8759011 bees: make it build with clang
Remove unused "addr check" functions.  We have ranged_cast for detecting
overflow bits.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
e34237886d process: make it build with clang
Get rid of unused template instantiation.

Drop the unused realtime signals from the ntoa table.  If in the future
we really need to solve clang's issue with them, we'll address it then.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
29b051b131 task: make it build with clang
Remove unused closure captures.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
8e9b53b3fd stats: remove nonsense dedup_unique_bytes stat
A long time ago, when bees used dedicated threads to scan each subvol, the
calculation of the "dedup_unique_bytes" statistic was still wrong.

This stat can only be calculated when dedupe runs on extent data items
instead of extent reference items.  Remove the stat variable until
that happens.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-12-17 17:54:51 -05:00
Zygo Blaxell
1b9b437c11 docs: btrfs-kernel: 4.20 adds 32-bit single convert bug, tree mod log issue #4
There was a 4th tree mod log crash that showed up in testing.  It can
be reproduced or eliminated by applying or reverting d2311e698578
("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
to a 5.4.x kernel before 5.4.54.

Unfortunately, the test can only run if several other patches that
fixed other bugs in d2311e698578 are applied or removed at the same time.
Commit d2311e698578 introduces a bug which destroys filesystems under test
long before tree mod log failures can be reproduced in testing.  One of
those patches also fixes tree mod log issue #4.  I do not know which one,
but since kernels after 5.1 cannot run without all of those patches, I do
not think it matters.

Tree mod issue #4 is the reason why the tree mod workaround is still
required on all kernels before 5.4.  The issue still exists on older
LTS kernels, e.g. 4.9.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-10-09 21:25:23 -04:00
Zygo Blaxell
217f5c781b docs: expand the tree mod log issues
The fixes appear inconsistently in stable/LTS kernels, so they can't be
mashed into a single row.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-10-09 17:26:57 -04:00
Zygo Blaxell
dceea2ebbc docs: improve send workaround text, add references to backref commits, make grammar more good now
Rewrite the text related to 'btrfs send' to clarify that the send
workaround is no longer necessary to avoid kernel crashes, but still
useful because send and dedupe still do not work at the same time.

Replace "many backref code changes" with a specific commit reference,
and improve the grammar of some issue descriptions.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-10-09 16:25:29 -04:00
Zygo Blaxell
bb8b6d6c50 docs: fix table formatting for kernel bugs list
Apparently there's Github Flavored Markdown, and there's the markup
language that github uses, and they are distinct things.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-10-09 12:52:47 -04:00
Zygo Blaxell
6843846b97 docs: update kernel bug tracking for October 2020
Present known kernel bugs in table form with issue descriptions,
fixed and broken kernel versions, and references to fixes.

Update kernel version recommendations to include information on kernel
versions up to 5.8.14.

Reduce emphasis on data corruption bugs which are 1) two or more
years old now, and 2) much less bad than the bugs in kernel 5.1.

Add deprecation warning for kernels before 4.15.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-10-09 12:24:14 -04:00
Zygo Blaxell
d040bde2c9 docs: use Github Flavored Markdown with table extension
Prefer to use cmark-gfm with extension 'table' so we can use tables in
locally-generated HTML files.  If cmark-gfm is not installed then
fall back to some other Markdown implemeentation, but the tables will
be broken on every other implementation I have tried so far.

Also make the HTML output depend on the Makefile, since there may be
document translation options specified there (like '-e table' or an
entirely different Markdown implementation).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-10-09 12:24:14 -04:00
Zygo Blaxell
15ab981d9e bees: replace uncaught_exception(), deprecated in C++17
uncaught_exception() had only the one valid use case, and it can be
reimplemented by literally calling current_exception() instead.

current_exception() has several valid use cases, so it is not likely
to be deprecated any time soon.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-10-09 12:07:10 -04:00
Zygo Blaxell
05bd65444d bees: initialize context in the correct order
We cannot use BeesContext::roots() until after
BeesContext::set_root_path() has been called.
Save up the parameter settings until then.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2020-08-31 22:35:17 -04:00
Vladimir Panteleev
2427dd370e scripts: Remove beescrawl.dat with -f
Avoid interactive prompts due to e.g. bad file modes.
2020-06-17 06:59:43 +00:00
Vladimir Panteleev
8aa343cecb scripts: Update beescrawl.dat file name after UUID removal
Commit 06e111c229 removed the UUID from
the beescrawl.dat file name, but this change was not also applied to
the wrapper script. Do that now.
2020-06-15 15:08:15 +00:00
Andrey Brusnik
9514b89405 fs: Change array syntax to pointer syntax
Fixes #141
Signed-off-by: Andrey Brusnik <pixeliz3d@protonmail.com>
2020-04-09 22:09:20 +03:00
Zygo Blaxell
07e5e7bd1b docs: update known kernel bugs list
"Storm of softlockups" starts with a simple BUG_ON, but after the
BUG_ON, all cores that are waiting on spinlocks get stuck.
The _first_ kernel call trace is required to identify the bug.
At least two such bugs have been identified.

Add some notes about the conflict between LOGICAL_INO and balance,
and the recently added bees workaround.

Update the gotchas page for balances to point to the kernel bugs page.
Remove "bees and the full balance will both work correctly" as that
statement is not true.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-11-28 00:17:10 -05:00
Zygo Blaxell
c4f0e4abee context: workaround to prevent LOGICAL_INO and btrfs balance from running concurrently
This avoids some kernel bugs.  One of them is fixed in 5.3.4 and later:

	efad8a853a "Btrfs: fix use-after-free when using the tree modification log"

There are apparently others in current kernels, so for now just put bees
on pause until the balance is done.

At some point we may want to provide an option to disable this
workaround; however, running bees and balance at the same time makes
neither particularly fast, so maybe we'll just leave it this way.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-11-28 00:13:15 -05:00
Kai Krakow
44f446e17e bees-context: Remove confusing log message
Saying just "This feature" at some log levels could be puzzling. Let's
remove this message, the feature works without problems for a year.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2019-11-06 09:04:52 +01:00
Zygo Blaxell
4363463342 process: Fix gettid() ambiguity with glibc >= 2.30
In version 2.30 glibc added it's own gettid() function. This resulted in
"error: call of overloaded ‘gettid()’ is ambiguous" because gettid()
now exists in both namespace crucible and std.

For now, use explicit references to namespace crucible.  This continues
to work with new and old libc without having to test specific library
versions.

At some point, glibc gettid() will be deployed widely enough that we can
remove the crucible version entirely.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-10-30 00:12:33 -04:00
Zygo Blaxell
7117cb40c5 hash: prepare for user-selectable hash functions
Localize the hash function in bees to a single spot to make it easier
to change later (or at runtime).

Remove some code that was using a property of CRC as an optimization.
The optimization doesn't work for other hash functions, and running the
CRC function takes more CPU time than the optimization saved.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-06-12 22:48:06 -04:00
Zygo Blaxell
b3a8fcb553 lib: add cityhash function
CityHash64 appears to be the fastest available block hashing algorithm
that is good enough for dedupe.  It takes much less CPU than the CRC64
function, and avoids hash-collision problems with file formats that use
CRC64 as an integrity check on 4K block boundaries.

Extracted from git://github.com/google/cityhash with the "CRC" hash
functions (which require Intel/AMD CPU support) removed.  We don't
need those, and they introduce a new (if only theoretical) build-time
dependency.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-06-12 22:48:06 -04:00
Zygo Blaxell
228747a8f8 lib: fix non-local lambda expression cannot have a capture-default
We got away with this because GCC 4.8 (and apparently every GCC prior
to 9) didn't notice or care, and because there is nothing referenced
inside the lambda function body that isn't accessible from any other
kind of function body (i.e. the capture wasn't needed at all).

GCC 9 now enforces what the C++ standard said all along:  there is
no need to allow capture-default in this case, so it is not.

Fix by removing the offending capture-default.

Fixes: https://github.com/Zygo/bees/issues/112
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-06-12 22:48:06 -04:00
Zygo Blaxell
87e8a21c41 fs: do not emulate extent-same by clone
It is not possible to emulate extent-same by clone in a safe way.
EXTENT_SAME has been supported in btrfs since kernel 3.13, which
is much too old to contemplate running bees on.

Remove this dangerous and unused function.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-06-12 22:48:06 -04:00
Zygo Blaxell
e3747320cf BtrfsExtentWalker: use a buffer at least as large as a btrfs metadata page to avoid EOVERFLOW
We are getting a lot of exceptions when an inline extent is too large for the
TREE_SEARCH_V2 buffer.  This disrupts ExtentWalker's extent boundary
search when there is an inline extent at the beginning of a file:

	# fiemap foo
	Log 0x0..0x1000 Phy 0x0..0x1000 Flags FIEMAP_EXTENT_NOT_ALIGNED|FIEMAP_EXTENT_DATA_INLINE
	Log 0x1000..0x2000 Phy 0x7307f9000..0x7307fa000 Flags 0
	Log 0x2000..0x3000 Phy 0x731078000..0x731079000 Flags 0
	Log 0x3000..0x5000 Phy 0x73127d000..0x73127f000 Flags FIEMAP_EXTENT_ENCODED
	Log 0x5000..0x6000 Phy 0x73137a000..0x73137b000 Flags 0
	Log 0x6000..0x7000 Phy 0x731683000..0x731684000 Flags 0
	Log 0x7000..0x8000 Phy 0x73224f000..0x732250000 Flags 0
	Log 0x8000..0x9000 Phy 0x7323c9000..0x7323ca000 Flags 0
	Log 0x9000..0xb000 Phy 0x732425000..0x732427000 Flags FIEMAP_EXTENT_ENCODED
	Log 0xb000..0xc000 Phy 0x732598000..0x732599000 Flags 0
	Log 0xc000..0xd000 Phy 0x7325d5000..0x7325d6000 Flags FIEMAP_EXTENT_LAST

	# fiewalk foo
	exception type std::system_error: BTRFS_IOC_TREE_SEARCH_V2: /tmp/foo at fs.cc:844: Value too large for defined data type

Normally crawlers simply skip over inline extents, but ExtentWalker will
seek backward from the first non-inline extent to confirm that it has
an accurate starting block for the target extent.  This fails when it
encounters the first inline extent.

strace reveals that buffer size is too small for the first extent,
as seen here:

	ioctl(3, BTRFS_IOC_TREE_SEARCH_V2, {key={tree_id=258, min_objectid=78897856, max_objectid=UINT64_MAX, min_offset=0, max_offset=UINT64_MAX, min_transid=0, max_transid=UINT64_MAX, min_type=BTRFS_EXTENT_DATA_KEY, max_type=BTRFS_EXTENT_DATA_KEY, nr_items=16}, buf_size=1360} => {buf_size=1418}) = -1 EOVERFLOW (Value too large for defined data type)

Fix this by increasing the buffer size until it can handle the largest
possible object on the largest possible btrfs metadata page (65536 bytes).
BtrfsExtentWalker already has optimizations to minimize the allocation
cost, so we don't need any changes there.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-06-12 22:48:05 -04:00
Zygo Blaxell
2c3d1822f7 bees: don't try to print si_lower and si_upper
Some build environments (ARM?  AARCH64?) do not have the fields
si_lower and si_upper in siginfo.

bees doesn't need them, so don't try to access them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-06-12 22:48:05 -04:00
Zygo Blaxell
b149528828 docs: tested build with btrfs-progs 4.20.2
Update the version ranges on the dependencies.

FIXME/TODO:  start dropping early versions that don't work with current
code?

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-06-12 22:48:05 -04:00
Zygo Blaxell
ce2521b407 docs: update btrfs feature interaction status for flushoncommit and SSD caching layers
flushoncommit or not-flushoncommit isn't really a bees matter--it's
a sysadmin's tradeoff between reliability and performance.  bees does
not affect that tradeoff because all dedupe src extents are flushed, so
bees introduces no *new* data loss risks in the noflushoncommit
case--i.e. any data that you could lose while running bees, you'd also
lose when not running bees.

Note that the converse is not true:  bees might trigger flushing on
data that would not normally have been flushed with noflushoncommit,
and improve data integrity after a crash as a side-effect of dedupe
operations.  The risks of noflushoncommit might be reduced by running
bees.  I don't have evidence based on experimental data to support that
conclusion, so I'll just leave this possibility as a rumor in a commit
log message.

lvmcache can be moved from the "bad" list to the "good" list now.

bcache remains in the "bad" list due to some non-data-losing failures
that only seem to happen with bcache.

Add a note about CPUs with strange endianness or page sizes, as nobody
seems to have tried those.

Remove "at great cost" from the btrfs send workaround.  The cost is
the cost, there is no need to editorialize.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-06-12 22:48:05 -04:00
Zygo Blaxell
17a75e61f8 README: highlight DATA CORRUPTION WARNING
The existence of information about known data corruption bugs should be
visible from the top-level page.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-06-12 22:48:05 -04:00
Zygo Blaxell
e1476260e1 docs: update kernel compatibility page, now recommending 5.0.4
* comprehensive list of kernels with bees-triggered corruption bug fixes
 * deadlock between dedupe and rename is now fixed (in some places)
 * compressed data corruption is now fixed (in more places)
 * btrfs send fix for one bug is now merged in 5.2-rc1, another bug remains
 * retired the bcache/lvmcache bug (can't reproduce those bugs any more,
   although I *can* reproduce an interesting non-destructive bcache bug)
 * new minor bug entries for two harmless kernel warnings
 * new entry for storm-of-soft-lockups

Fixes: https://github.com/Zygo/bees/issues/107
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-06-12 22:47:57 -04:00
Zygo Blaxell
978c577412 status: report number of active worker threads in status output
This is especially useful when dynamic load management allocates more
worker threads than active tasks, so the extra threads are effectively
invisible.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-01-07 22:52:12 -05:00
Zygo Blaxell
7548d865a0 docs: event counter documentation
This may help users understand some of the things that happen inside
bees...or it may just be horribly long and confusing.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-01-07 22:48:16 -05:00
Zygo Blaxell
4021dd42ca task: queue and run exactly once per instance
Enable much simpler Task management:  each time a Task needs to be done
at least once in the future, simply invoke the run() method on the Task.
The Task will ensure that it only runs once, only appears in a queue
once, and will run again if a run request is made while the Task is
already running.

Make the queue policy a member of the Task rather than a method.  This
enables Tasks to reschedule themselves, possibly on the appropriate queue
if we have more than one of those some day.

This happens to make Tasks more similar to Linux kernel workers.
This similarity is coincidental, but not undesirable.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-01-07 22:48:15 -05:00
Zygo Blaxell
e1de933f93 docs: add some notes about interactions with balance
Prompted by discussion at https://github.com/Zygo/bees/issues/105

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-01-07 22:48:15 -05:00
Zygo Blaxell
f41fd73760 docs: add Gotcha for SIGTERM
This summarizes the discussion at:

	https://github.com/Zygo/bees/issues/100

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-01-06 01:54:57 -05:00
Zygo Blaxell
d583700962 docs: describe expected exceptions and impact of exception handling
Add some docs about the exceptions that are less easy to suppress
directly.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-01-06 01:54:57 -05:00
Zygo Blaxell
be2c55119e bees: make exceptions less prominent in log output
Introduce a mechanism to suppress exceptions which do not produce a
full stack trace for common known cases where a loop should be aborted.
Use this mechanism to suppress the infamous "FIXME" exception.

Reduce the log level to at most NOTICE, and in some cases DEBUG.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2019-01-06 01:48:35 -05:00
Zygo Blaxell
4a1971bce5 process: SIGUNUSED is deprecated
SIGUNUSED is not defined in many environments (it seems to be defined
in only one I've tried so far).  Hide the reference with #ifdef.

Fixes: https://github.com/Zygo/bees/issues/94
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-12-13 18:03:35 -05:00
Zygo Blaxell
843f78c380 docs: bees can stop now
Remove the paragraph stating otherwise.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-12-10 19:56:08 -05:00
Zygo Blaxell
5f063dd752 docs: tested with GCC 6.3.0
Update the list of compiler versions tested.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-12-09 23:39:44 -05:00
Zygo Blaxell
7933ccb660 build: make libcrucible a static library
libcrucible at one time in the distant past had to be a shared library
to force global C++ object initialization; however, this is no longer
required.

Make libcrucible static to solve various rpath and soname versioning
issues, especially when distros try (unwisely) to package the library
separately.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-12-09 23:39:44 -05:00
Zygo Blaxell
f17cf084e6 hash: clean up comments, audit for bugs
We stopped supporting shared hash tables a long time ago.  Remove comments
describing the behavior of shared hash tables.

Add an event counter for pushing a hash to the front when it is already at
the front.

Audited the code for a bug related to bucket handling that impairs space
efficiency when the bucket size is greater than 1.  Didn't find one.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-12-09 23:39:44 -05:00
Zygo Blaxell
570b3f7de0 bees: handle SIGTERM and SIGINT, force immediate flush and exit
Capture SIGINT and SIGTERM and shut down, preserving current completed
crawl and hash table state.

  * Executing tasks are completed, queued tasks are paused.
  * Crawl state is saved.
  * The crawl master and crawl writeback threads are terminated.
  * The task queue is flushed.
  * Dirty hash table extents are flushed.
  * Hash prefetch and writeback threads are terminated.
  * Hash table is deallocated.
  * FD caches and tmpfiles are destroyed.
  * Assuming the above didn't crash or deadlock, bees exits.

The above order isn't the fastest, but it does roughly follow the
shared_ptr dependencies and avoids data races--especially those that
might lead to bees reporting an extent scanned when it was only queued
for future scanning that did not occur.

In case of a violation of expected shared_ptr dependency order,
exceptions in BeesContext child object accessor methods (i.e. roots(),
hash_table(), etc) prevent any further progress in threads that somehow
remain unexpectedly active.

Move some threads from main into BeesContext so they can be stopped
via BeesContext.  The main thread now runs a loop waiting for signals.

A slow FD leak was discovered in TempFile handling.  This has not been
fixed yet, but an implementation detail of the C++ runtime library makes
the leak so slow it may never be important enough to fix.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-12-09 23:39:44 -05:00
Zygo Blaxell
cbc6725f0f time: separate sleep time calculation from sleep_for method
We need to replace nanosleeps with condition variables so that we
can implement BeesContext::stop.  Export the time calculation from
sleep_for() into a new method called sleep_time().

If the thread executing RateLimiter::sleep_for() is interrupted, it will
no longer be able to restart, as the sleep_time() method is destructive.
This calls for further refactoring of sleep_time() into destructive
and non-destructive parts; however, there are currently no users of
sleep_for() which rely on being able to restart after being interrupted
by a signal.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-12-09 23:45:52 -05:00
Zygo Blaxell
0e42c75f5a process: ntoa function for signals
This enables signal numbers to be translated to names.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-12-09 23:45:52 -05:00
Zygo Blaxell
4e962172a7 task: add cancel method
Add a method to have TaskMaster discard any entries in its queue, terminate
all worker threads, and prevent any new Tasks from being queued.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-12-09 01:15:24 -05:00
Zygo Blaxell
389dd52cc1 tempfile: drop the fsync()
The deadlock seems to be fixed now (if there ever was one--there certainly
were deadlocks, but matching deadlocks to root causes is non-trivial
and a number of distinct deadlock cases have been fixed in recent years).

The benchmark data is inconclusive about whether it is better to fsync or
not to fsync.  A paranoia option might be useful here.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-12-09 01:00:36 -05:00
Zygo Blaxell
f4464c6896 roots: quick fix for task scheduling bug leading to loss of crawl_master
The crawl_master task had a simple atomic variable that was supposed
to prevent duplicate crawl_master tasks from ending up in the queue;
however, this had a race condition that could lead to m_task_running
being set with no crawl_master task running to clear it.  This would in
turn prevent crawl_thread from scheduling any further crawl_master tasks,
and bees would eventually stop doing any more work.

A proper fix is to modify the Task class and its friends such that
Task::run() guarantees that 1) at most one instance of a Task is ever
scheduled or running at any time, and 2) if a Task is scheduled while
an instance of the Task is running, the scheduling is deferred until
after the current instance completes.  This is part of a fairly large
planned change set, but it's not ready to push now.

So instead, unconditionally push a new crawl_master Task into the queue
on every poll, then silently and quickly exit if the queue is too full
or the supply of new extents is empty.  Drop the scheduling-related
members of BeesRoots as they will not be needed when the proper fix lands.

Fixes: 4f0bc78a "crawl: don't block a Task waiting for new transids"
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-25 23:46:55 -05:00
Zygo Blaxell
f051d96d51 docs: dash more useful than previously believed
It turns out both dash and bash support `command -v` so let's use that.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-25 23:21:52 -05:00
Zygo Blaxell
ba5fda1605 docs: use bash "type -p" because dash isn't useful
If /bin/sh is bash, the 'type' builtin produces a list of filenames
that match the arguments to $PATH.

If /bin/sh is dash, we get errors like:

	/bin/sh: 1: P:: not found

Hopefully having a build-dep on bash is not controversial.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-22 21:37:09 -05:00
Zygo Blaxell
6cf16c4849 docs: add instructions for Ubuntu 18.10
As described in https://github.com/Zygo/bees/issues/88

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-22 21:36:39 -05:00
Zygo Blaxell
5a80ce5cd6 README: reintroduce new btrfs-send-compatibility workaround
Now it appears in both the github.io and github.com feature lists.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-22 21:22:10 -05:00
Zygo Blaxell
012219bbfb docs: derive docs/index.md from README.md
The two files are identical except README.md links to docs/* while
index.md links to *.

A sed script can do that transformation, so use sed to do it.

This does modify a file in git, but this is necessary to make all
the Github views work consistently.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-22 21:21:29 -05:00
Zygo Blaxell
bf2a014607 roots: improve "RO root 6094" message
This sequence of log messages isn't clear:

	crawl_master: WORKAROUND: Avoiding RO subvol 6094
	crawl_master: WORKAROUND: RO root 6094

The first is from a cache miss, and appears wherever a root is opened
(dedupe or crawl).  The second is skipping an entire subvol scan, and
only happens in crawl_master.

Elaborate on the second message a little.

Also use the term "root" consistently when referring to subvol tree IDs.
btrfs refers to these objects by (at least) three distinct names:  tree,
subvol, and root.  Using three different words for the same thing is worse
than using a single wrong word consistently to refer to the same concept.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-22 21:10:15 -05:00
Zygo Blaxell
cdca2bcdcd main: single BeesContext instance per process
After weeks of testing I copied part of a change to main without copying
the rest of the change, leading to an immediate segfault on startup.

So here is the rest of the change:  limit the number of
BeesContexts per process to 1.  This change was discussed at
https://github.com/Zygo/bees/issues/54#issuecomment-360332529 but there
are more reasons to do it now:  the candidates to replace the current
hash table format are less forgiving of sharing hash tables, and it may
even become necessary to have more than one hash table per BeesContext
instance (e.g. to keep datasum and nodatasum data separate).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-22 20:40:30 -05:00
Zygo Blaxell
e0c8df6809 docs: working with btrfs send is kind of a feature
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-21 23:19:37 -05:00
Zygo Blaxell
34b04f4255 bees: soft-limit computed thread counts to 8
https://github.com/Zygo/bees/issues/91 describes problems encountered
when running bees on systems with many CPU cores.

Limit the computed number of threads (using --thread-factor or the
default) to a maximum of 8 (i.e. the number of logical cores in a modern
laptop).  Users can override the limit by using --thread-count.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-21 21:49:16 -05:00
Zygo Blaxell
d9c788d30a docs: reorganize options, add workaround for btrfs send
options.md was a disorganized mess that markdown couldn't parse properly.

Break the options list down into sections by theme.  Add the new
'--workaround-btrfs-send' option to the new 'Workarounds' section.

Clean up the rest of the text and fix some inconsistencies.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-21 21:49:16 -05:00
Zygo Blaxell
23f3e4ec42 workarounds: add workaround for btrfs send
Introduce --workaround options which trade performance or effectiveness to
avoid triggering kernel bugs.

The first such option is --workaround-btrfs-send, which avoids making any
modification to read-only subvols to avoid btrfs send bugs.

Clean up usage message:  no tabs for formatting, split options into
sections by theme.

Make scan mode a non-static data member like all (most?) other options.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-21 21:49:16 -05:00
Kai Krakow
6c68e81da7 Makefile: Fix git usage for non-git source archive
We didn't take enough care to fix all invocations of git in this
scenario.

Fixes: 32d2739 ("Makefile: Specify version when building from tarball")
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-11-18 16:10:32 +01:00
Zygo Blaxell
e74122b512 resolver: don't log hash collision incidents
The log message is quite CPU-intensive to generate, and some data sets
have enough hash collisions to throw off benchmarks.

Keep the event counter but drop the log message.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-16 17:20:49 -05:00
Zygo Blaxell
0d5c018c3c fs: if search fails, return empty result set
Make sure the result set is empty before running the ioctl in case
something tries to consume the result without checking the error status.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-16 17:20:49 -05:00
Zygo Blaxell
a676928ed5 fs: remove thread_local storage
If we are not zero-filling containers then the overhead of allocating them
on each use is negligible.  The effect that the thread_local containers
were having on RAM usage was very non-negligible.

Use dynamic containers (members or stack objects) for better control
of object lifetimes and much lower peak RAM usage.  They're a tiny bit
faster, too.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-08 23:55:13 -05:00
Zygo Blaxell
e3247d3471 stats: streamline add_count
Perf was blaming BeesStats::add_count for >1% of instructions.

Trim the instruction count a little.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-08 23:31:50 -05:00
Zygo Blaxell
19859b0a0d docs: toxic extents and btrfs send
Update documentation of toxic extent / slow backref workaround.

Add notes about btrfs send kernel bugs and incremental send failures.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-08 21:31:02 -05:00
Kai Krakow
688d0dc014 crucible: Try repairing a build failure around swap macro
Gentoo-Bug: https://bugs.gentoo.org/670606
Fixes: https://github.com/Zygo/bees/issues/85
Suggested-by: Zygo Blaxell <bees@furryterror.org>
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-11-08 19:29:11 +01:00
Kai Krakow
c69a954d8f Makefile: Bring back -O3 in a downstream-compatible way
This commit brings back -O3 but in an overridable way. This should make
downstream distributions happy enough to accept it.

While at the subject, let's apply the same fixup logic to LDFLAGS, too.

This commit also properly gets rid of the implicit rules which collided
too easily with the depends.mk.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-11-08 03:23:40 +01:00
Kai Krakow
f2dec480a6 Makefile: mkdir .depends only when needed
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-11-08 02:56:48 +01:00
Kai Krakow
d4535901a5 Makefile: Use the jobserver properly
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-11-08 02:52:04 +01:00
Zygo Blaxell
8cbd6fc67a fs: support LOGICAL_INO_V2
Automatically fall back to LOGICAL_INO if LOGICAL_INO_V2 fails and no
_V2 flags are used.

Add methods to set the flags argument with build portability to older
headers.

Use thread_local storage for the somewhat large buffers used by
LOGICAL_INO_V2 (and other users of BtrfsDataContainer like INO_PATHS).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-05 21:12:36 -05:00
Zygo Blaxell
c2762740ef context: remove limit on the number of references to an extent
Better toxic extent detection means we can now handle extents with
many more references--easily hundreds of thousands.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-11-05 21:12:11 -05:00
rsjaffe
8bec9624da systemd service replace deprecated parameters
Replace CPU shares and IO block weight by CPU weight and IO weight. Note that new parameters are roughly 1/100 of old one--I believe that's the right conversion. Also removed duplicate Nice parameter and alphabetized the parameters for ease of reading.
2018-11-05 12:35:17 -08:00
Zygo Blaxell
aa74a238b3 hash: remove preloaded toxic hash blacklist
Faster and more reliable toxic extent detection means we can now be much
less paranoid about creating toxic extents.

The paranoia has significant impact on dedupe hit rates because every
extent that contains even one toxic hash is abandoned.  The preloaded
toxic hashes were chosen because they occur more frequently than any
other block contents in typical filesystem data.  The combination of these
resulted in as much as 30% of duplicate extents being left untouched.

Remove the preloaded toxic extent blacklist, and rely on the new
kernel-CPU-usage-based workaround instead.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-31 23:03:01 -04:00
Zygo Blaxell
6e6b08ea0e scripts: put AL16M back to avoid breaking existing scripts
Leave AL16M defined in beesd to avoid breaking scripts based on
beesd.conf.sample which used this constant.

Use the absolute size in beesd.conf.sample to avoid any future problems.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-31 22:50:36 -04:00
Zygo Blaxell
542371684c context: better detection for toxic extents
We detect toxic extents by measuring how long the LOGICAL_INO ioctl takes
to run.  If it is above some threshold, we consider the extent toxic,
and blacklist it; otherwise, we process the extent normally.

The detector was using the execution time of the ioctl, which detects
toxic extents, but it also detects pauses of the bees process and
transaction commit latency due to load.  This leads to a significant
number of false positives.  The detection threshold was also very long,
burning a lot of kernel CPU before the detection was triggered.

Use the per-thread system CPU statistics to measure the kernel CPU usage
of the LOGICAL_INO call directly.  This is much more reliable because it
is not confounded by other threads, and it's faster because we can set
the time threshold two orders of magnitude lower.

Also remove the lock and mutex added in "context: serialize LOGICAL_INO
calls" because we theoretically no longer need it (but leave the code
there with #if 0 in case we do need it in practice).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-31 21:12:16 -04:00
Zygo Blaxell
9a97699dd9 roots: reimplement transid_max_nocache using extent tree root
ROOT_TREE contains the ROOT_ITEM for EXTENT_TREE.  Every modification
(that we care about) to a btrfs must go through EXTENT_TREE, and must
modify the page in ROOT_TREE pointing to the root of EXTENT_TREE...
which makes that a very good source for the filesystem transid.

Remove the loop and the root lookups, and just look at one item for
max_transid.

Also note that every caller of transid_max_nocache() immediately
feeds the return value to m_transid_re.update(), so don't do that
inside transid_max_nocache().

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-31 00:09:49 -04:00
Zygo Blaxell
0e8b591232 Revert "roots: simplify BeesRoots::transid_max_nocache"
It turns out that we do need to scan all the subvols in order
to find transid_max.

Keep the bug fix though.

This reverts commit bf6ae80eee.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-30 23:29:05 -04:00
Zygo Blaxell
bf6ae80eee roots: simplify BeesRoots::transid_max_nocache
BeesRoots::transid_max_nocache calls btrfs_get_root_transid() which
retrieves the transid of the root of the given Fd.  Since the FS_TREE
(subvol 5) is the root of the subvol hierarchy, it will always have
the highest transid on the filesystem, and we do not need to look at
any others.

Also fix a bug where we pass BTRFS_FS_TREE_OBJECTID instead of the
file descriptor root_fd() to btrfs_get_root_transid().  If BEESHOME
is somewhere on the same btrfs filesystem, and there are no leaked FDs
at bees startup, then BTRFS_FS_TREE_OBJECTID (5) usually has the same
integer value as a valid file descriptor of some object on the filesystem
that has a regularly increasing transid value.  If Fd 5 happens to be a
file in BEESHOME then bees itself drives the transid increments.  This,
combined with the search of all subvol roots, hides the bug (unless Fd
5 gets closed somehow).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-30 21:12:17 -04:00
Zygo Blaxell
1a51bb53bf context: cache result of home_fd()
BeesContext::home_fd() is supposed to open $BEESHOME once and cache
the Fd for later calls; however, instead it was reopening a new Fd each
time it was called, and _also_ holding that Fd in a BeesContext member.
Fds clean themselves up when they are forgotten, so it was not leaking
per se, but it certainly had more open Fds than it needed to.

Check to see if we have m_home_fd open, and return that if so.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-30 21:12:16 -04:00
Zygo Blaxell
35b21687bc bees: drop unused member m_uuid
There is a m_root_uuid which is used.  m_uuid is not, so drop it
and save a tiny amount of memory.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-30 21:12:16 -04:00
Zygo Blaxell
63ddbb9a4f context: serialize LOGICAL_INO calls
LOGICAL_INO can trip over the btrfs slow-backrefs bug, resulting in
some very long in-kernel runtimes.  If too many threads are executing
LOGICAL_INO then there may be no cores left on the system to run other
tasks.

Toxic extent detection is done by a very rudimentary algorithm which
can be confused by unrelated sources of latency within btrfs (especially
commit latency).  The algorithm can also be confused by other threads
executing the LOGICAL_INO ioctl.

These are two good reasons to prevent any two threads in a single bees
process instance from executing LOGICAL_INO at the same time, so let's
do that.

It is possible to limit the number of threads executing LOGICAL_INO with
the -c and -C options; however, this also limits the number of threads
which can perform any operation, while only LOGICAL_INO (*) has such a
profound effect on the rest of system operation.

Also make the status message clearer about exactly when LOGICAL_INO is
executed, as opposed to merely waiting to acquire a lock before executing
the ioctl.

(*) or maybe FILE_EXTENT_SAME.  The problem function that keeps showing
up in kernel stack traces is find_parent_nodes, which is called by both
the LOGICAL_INO and FILE_EXTENT_SAME ioctls.  We'll try this change
first and see if it prevents any recurrences of forced watchdog reboots;
if it does not, then we'll limit FILE_EXTENT_SAME the same way.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-30 21:12:16 -04:00
Zygo Blaxell
373b9ef038 roots: fix subvol scan rollover on subvols with empty transid range
The ordering function for BeesCrawlState did not consider

	root 292 inode 0 min_transid 2345 max_transid 3456

to be larger than

	root 292 inode 258 min_transid 2345 max_transid 2345

so when we attempted to update the end pointer for the crawl progress,
the new state was not considered newer than the old state because the
min_transid was equal, but the new crawl state's inode number was smaller.

Normally this is not a problem because subvol scans typically begin
and end in separate transactions (in part because we don't start a
subvol scan until at least two transactions are available); however,
the cleanup code for the aftermath of the recent transid_min() bug can
create crawlers with equal max_transid and min_transid records.

Fix this by ordering both transid fields before any others in the
crawl state.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-30 21:12:14 -04:00
Zygo Blaxell
866a35c7fb roots: do not accept 18446744073709551615 as max_transid in beescrawl.dat
Due to an earlier bug some beescrawl.dat files will contain uint64_t
max as max_transid.  This prevents any further scanning on the subvol
because there is no possibiity of having a real transid (or any other
uint64_t number) larger than uint64_t max.

If we detect a bad transid in beescrawl.dat, log a warning, then use
some more plausible value:  either min_transid to repeat the previous
incremental crawl, or 0 to restart the subvol scan from the beginning.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-30 21:12:14 -04:00
Zygo Blaxell
90132182fd roots: do not allow transid_min to be numeric_limits<uint64_t>::max()
On a few test machines max_transid on subvols is getting set to
18446744073709551615 (aka uint64_t max).

Prevent transid_min() from ever returning this value.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-30 21:12:14 -04:00
Zygo Blaxell
90f98250c2 hash: remove pointless copy
"saved" is used only during hash table correctness analysis, which is
normally not enabled at compile time, and requires source modification
to enable.

Remove the pointless copy and save a tiny bit of CPU.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-19 20:21:04 -04:00
Zygo Blaxell
0c714cd55c scripts: use multiples (not power) of 128K
Adjust the scripts for the new smaller hash table extent size.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-19 20:21:04 -04:00
Zygo Blaxell
924008603e hash: reduce hash table extent size to 128KB
The 16MB hash table extent size did not serve any useful defragmentation
or compression purpose, and for very small filesystems (under 100GB),
16MB is much larger than necessary.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-19 20:21:04 -04:00
Zygo Blaxell
c01f129eee src: add bees-version.new.c to .gitignore
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-19 20:21:04 -04:00
Zygo Blaxell
5a49870fc9 docs: add coredumpctl
systemd-coredumpctl collects core files for later analysis
with gdb.  It's a convenient thing if the keys you use to encrypt
/var/lib/systemd/coredump are the same as the keys you use to encrypt
the filesystem where you're running bees.

Add it to the documentation just before the hand-rolled version.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-19 20:21:04 -04:00
Zygo Blaxell
14b35e3426 docs: add "what to do when something goes wrong" page
Standard crash backtrace collection, plus $BEESSTATUS for the high-level
overview of what bees is doing.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-04 20:54:08 -04:00
Zygo Blaxell
7bba096077 Merge remote-tracking branch 'nilninull/master' 2018-10-02 22:13:55 -04:00
nilninull
aa324de9ed FIX: The systemd service file is always installed 2018-10-03 10:19:43 +09:00
Zygo Blaxell
e8298570ed README: split into sections, reformat for github.io
Split the rather large README into smaller sections with a pitch and
a ToC at the top.

Move the sections into docs/ so that Github Pages can read them.

'make doc' produces a local HTML tree.

Update the kernel bugs and gotchas list.

Add some information that has been accumulating in Github comments.

Remove information about bugs in kernels earlier than 4.14.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-10-02 03:41:31 -04:00
Kai Krakow
32d2739b0d Makefile: Specify version when building from tarball
When package maintainers build from a tarball, the .git directory does
not exist to extract the version tag. Let's add a hack to work around
this issue and let them specify `BEES_VERSION="v0.y"` on the make
cmdline.

Github-Bug: https://github.com/Zygo/bees/issues/75
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-30 04:20:26 +02:00
Kai Krakow
faf11b1c0c Update references to Gentoo
Gentoo has officially merged the ebuild into portage as of:
https://github.com/gentoo/gentoo/pull/9925

Let's update the readme and get rid of the `contrib/gentoo-bees`
directory, so we have no potentially outdated information in the future.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-29 22:26:56 +02:00
Kai Krakow
3504439d5c contrib/gentoo: Update ebuild
Now that the packaging preparations were merged, we should update the
ebuild to reflect the upstream master branch.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-27 10:55:24 +02:00
Zygo Blaxell
d4b3836493 extentwalker: don't fetch absurd numbers of extents just to throw them away
ExtentWalker doesn't gain significant benefits from caching, and the
extra SEARCH_V2 ioctls were blamed for a 33% kernel CPU overhead by perf.

Reduce the number of extents to 16 in lieu of fixing the caching.

This gives a significant speed boost on CPU-bound workloads compared
to the original 1024--almost 40% faster on a single SSD with a filesystem
consisting of raw VM images mounted with compress=zstd.

This also seems to reduce LOGICAL_INO overhead.  Perhaps SEARCH_V2 and
LOGICAL_INO were trying to lock the same extents, and interfering with
each other?

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-09-26 23:29:56 -04:00
Kai Krakow
f053e0e1a7 beesd: Fix the wrapper not finding any config file
`grep -q something | grep -q something_else` will never find anything.
The for-loop is redundant anyways because `grep -l` can already work for
us. Let's replace this with a shorter and working version.

CC: Timofey Titovets <timofey.titovets@synesis.ru>
(fixes: commit 06d41fd "Rewrite beesd arg parser")
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-16 17:56:31 -04:00
Zygo Blaxell
bcfc3cf08b Merge https://github.com/Zygo/bees/pull/62 2018-09-15 00:09:46 -04:00
Zygo Blaxell
9dbe2d6fee bees: add -G/--thread-min option for minimum thread count
The -g option limits the number of worker threads when the target load
average is exceeded.  On some systems the load normally runs high, and
continuous bees operation is required to avoid running out of disk space.

Add a -G/--thread-min option to force at least some threads to continue
running.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-09-14 23:50:07 -04:00
Zygo Blaxell
dd3c32a43d README: spell 'available' correctly
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-09-14 23:50:07 -04:00
Zygo Blaxell
3d536ea6df roots: if queue is full run again
The task queue may already be full of tasks when the crawl task is
executed.  In this case simply reschedule the crawl task at the
end of the current queue.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-09-14 23:50:06 -04:00
Zygo Blaxell
e66086516f bees: dynamic thread pool size based on system load average
Add -g / --loadavg-target parameter to track system load and add or
remove bees worker threads dynamically to keep system load close to the
loadavg target.  Thread count may vary from zero to the maximum
specified by -c or -C, and is adjusted every 5 seconds.

This is better than implementing a similar load average scheme from
outside of the process (though that is still possible) because the
in-process load tracker does not disrupt the performance timing feedback
mechanisms as a freezer cgroup or SIGSTOP would when controlling bees
from outside.  The internal load average tracker can also adjust the
number of active threads while an external tracker can only choose from
the maximum or zero.

Also fix a bug where a Task could deadlock waiting for itself to exit
if it tries to insert a new Task after the number of worker threads has
been set to zero.

Also correct usage message for --scan-mode (values are 0..2) since
we are touching adjacent lines anyway.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-09-14 23:50:03 -04:00
Zygo Blaxell
96eb100ded bees: use readahead instead of posix_fadvise
Other btrfs utils use readahead() not posix_fadvise().

There does not appear to be a performance or correctness difference
between the three (none, posix_fadvise, or readahead()).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-09-14 23:50:00 -04:00
Zygo Blaxell
041ad717a5 bees: configurable log verbosity
Log messages were already labelled with log levels, but there was no
way to filter by log level at run time.

Implement the filter inside the bees process so it can skip evaluation
of the BEESLOG* arguments if the log messages would not be emitted.

Fixes: https://github.com/Zygo/bees/issues/67

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-09-14 23:50:00 -04:00
Zygo Blaxell
b22db12390 context: log dedups with single unbroken log message
When BEESLOGINFO is called multiple times it generates separate log
records that can be mixed up when multiple threads dedup.

Use a single BEESLOGINFO call for each dedup to prevent this.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-09-14 23:50:00 -04:00
Zygo Blaxell
8938caa029 README.md: update build-deps
btrfs/ioctl.h has been moved to a different package.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-09-14 23:49:57 -04:00
Zygo Blaxell
8bc4bee8a3 crucible: progress: drop the set() method
set() was broken and redundant.  Calling hold() and discarding the
returned object has the correct effect.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-09-14 23:49:54 -04:00
Zygo Blaxell
1beb61fb78 crucible: error: record location of exception in what() message
Make the log show where the exception is thrown from.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-09-14 23:49:51 -04:00
Timofey Titovets
06d41fd518 Rewrite beesd arg parser
Signed-off-by: Timofey Titovets <timofey.titovets@synesis.ru>
2018-09-15 00:21:06 +03:00
Kai Krakow
788774731b Gentoo: Rework Gentoo ebuild into overlay
This commit squashes all the little changes from the previous
integration branch into one, adjusts to the new Makefile changes, and
introduces an overlay layout so that the contrib/gentoo-bees subtree
can be directly added as a Portage overlay to the system.

The following list contains the previous commit descriptions:

sys-fs/bees: Keyword tested architecture ~amd64

    Bees was tested on this platform.

sys-fs/bees: Add kernel version checks

    Add checking the kernel versions and write some info and/or warnings
    before building and installing the package. Running bees on older
    kernels may have some serious performance and stability impacts, let's
    tell the user about it.

    Closes #55

sys-fs/bees: Add metadata.xml

sys-fs/bees: There's no configure script

    So, there's no point in calling "default".

sys-fs/bees: Simplify src_configure()

sys-fs/bees: Don't depend on markdown

    It makes no sense to install both README.md and README.html, and we can
    get rid of one dependency.

Dependencies: btrfs-progs is no longer a buildtime-only dep

    It is actually needed by the bees service wrapper script, as pointed out
    by Gentoo QA review.

sys-fs/bees: DOCS is not needed

    "COPYING" is already covered by the licensing. The ebuild defaults
    already include README*

sys-fs/bees: Make warnings exclusive

    It was recommended by Gentoo QA to show only either one or another
    warning, and change the texts accordingly.

sys-fs/bees: RDEPEND is not implicit

    RDEPEND does not implicitly default to DEPEND. Let's explicitly set the
    variable.

sys-fs/bees: IUSE=test is only needed for explicit dependencies

    Thus, remove it.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 05:06:39 +02:00
Kai Krakow
679a327ac5 Makefile: Do not force optimizations by default
Make life easier for package maintainers by not forcing architecture or
compiler optimizations by default. E.g., Gentoo QA refuses to accept
both "-march=native" and "-O3". These are usually provided by the
package tooling.

Instead, we provide easily accessible templates in "makeflags".

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 04:05:15 +02:00
Kai Krakow
31b41bb3c2 Makefile: Do not force making README.html
This forces us to depend on markdown which would be otherwise optional.
Most of the time it is sufficient to let package managers just install
the README.md file.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 03:34:48 +02:00
Kai Krakow
d7e235c178 Makefile: "which" is not portable
It was pointed out by Gentoo QA that "type -P" is a better choice.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 03:14:18 +02:00
Kai Krakow
51108f839d Makefile: Due to VPATH, libcrucible links to hard-coded libuuid path
Due to VPATH and how make resolves source paths, libcrucible.so ends up
with a hard-coded path to link against libuuid.so. Let's fix it by
turning the general rule into an explicit rule for libcrucible.so.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 03:07:20 +02:00
Kai Krakow
8d102abf8b Makefile: create a template compiler
This creates a simple template compiler using sed in as a reusable
variable.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 02:59:54 +02:00
Kai Krakow
83e8f87dc9 Scripts: Don't prefix timestamps when running with systemd
Since systemd prefix it's own timestamps, we can unconditionally remove
timestamps when bees is executed by systemd.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 02:59:54 +02:00
Kai Krakow
4417b18d9e Makefile: .version.o is made from a generated file
We should probably not put it into the objects list. Let's instead
explicitly put it as a depend of libcrucible.so.

This allows us to not use *.cc as a depend for .version.cc which makes
more sense as CRUCIBLE_OBJS is also explicitly defined and not built
from wildcards.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 02:59:54 +02:00
Kai Krakow
8636312cab Compilation: Let the code know about package config
This commit adds support for putting package configuration options into
header files. This is needed to prepare reading config files from /etc.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 02:59:54 +02:00
Kai Krakow
17e1171464 Installation: Remove USR_PREFIX from Makefile
This commit removes USR_PREFIX and introduces ETC_PREFIX instead. The
purpose of PREFIX is the installation prefix in the system, not the
installation destination. The latter one is what DESTDIR is used for.

This should clear up the confusion. PREFIX was already mis-used as
installation destination. But that doesn't mix well with how the make
targets are designed.

CC: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 02:59:52 +02:00
Kai Krakow
9069201036 Scripts: Fix systemd unit not being templated
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 02:21:08 +02:00
Kai Krakow
ace814321f Makefile: Auto-detect systemd unit path
This uses pkg-config to detect the system unit dir.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 02:21:08 +02:00
Kai Krakow
451f0ad9aa Makefile: Allow installation of fiemap/fiewalk support tools
There's now a new make target called "install_tools" which would not run
by default on installation.

One can add "OPTIONAL_INSTALL_TARGETS=install_tools" into localconf to
install these by default.

fiewalk would be installed to sbin, as only root can run it, the other
goes to bin.

Gentoo can use this to optionally install these tools as a package
feature.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 02:20:59 +02:00
Kai Krakow
85f9265034 Makefile: make installing libs a separate target
This will allow installing fiemap/fiewalk support tools as an optional
install target.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 02:13:27 +02:00
Kai Krakow
5b28aad27f Makefile: Run install tests only for default target "reallyall"
Otherwise, tests would still run during "make install".

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 02:13:27 +02:00
Kai Krakow
6c47bb61c1 Makefile: remove tests from "make all"
Instead, introduce "make reallyall" and make it the default target. Now,
one can override the default target using localconf.

Needed for preparing Gentoo ebuild test behavior.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-09-08 02:13:27 +02:00
Timofey Titovets
2d14fd90e4 Update options in sample config
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2018-08-29 11:44:25 +03:00
Timofey Titovets
e0f315d47a Make beesd -h useful
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2018-08-29 11:44:25 +03:00
Zygo Blaxell
e564d27dda README: update known bugs and issues list
Also split "bad feature interactions" into "unknown" (which is what it
really was before) and "bad" (which includes some filesystem-destroying
problems).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-05-18 00:16:09 -04:00
Zygo Blaxell
c3effe0a20 crawl: use custom order instead of (ab)using BeesFileRange::operator<
This makes the code clearer and keeps changes to BeesFileRange ordering
isolated.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-05-18 00:16:08 -04:00
Zygo Blaxell
f8c27f5c6a bees: revert TOXIC_INTERVAL back to pre-4.14 levels
Linux kernel 4.14, while resistant to extent toxicity, is not immune to it.

Go back to the paranoid setting to avoid tying up filesystems in
ridiculously long kernel loops in find_parent_nodes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-05-18 00:16:08 -04:00
Zygo Blaxell
26039cd559 tempfile: update comments around bees_sync
Deadlock reproduced on kernel 4.14.34.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-05-18 00:16:04 -04:00
Zygo Blaxell
e9aef89293 fs: fix FTBFS on GCC 8
The memset is just doing an assignment from one dereferenced pointer to
another, so do an assignment to keep GCC 8 happy.

Fixes: https://github.com/Zygo/bees/issues/64

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-05-18 00:15:37 -04:00
Zygo Blaxell
c21518d8ff stats: rename "chase_wrong_data" to "chase_no_data"
An empty BeesBlockData from the chasing algorithm used to mean that data
was found at the expected location but it does not match; however, there
are now other reasons for this and they occur much more often.  The name
is misleading.

Change the name to report more correctly what happens:  no data, without
any guess about the reason.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-03-01 00:01:13 -05:00
Zygo Blaxell
082f04818f BeesBlockData: fix data type issues
Not sure if these cause any problems, but they are theoretically
incorrect data types.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-02-28 23:58:28 -05:00
Zygo Blaxell
5bdad7fc93 crucible: progress: a progress tracker for worker queues
The task queue can become very large with many subvols, requiring hours
for the queue to clear.  'beescrawl.dat' saves in the meantime will save
the work currently scheduled, not the work currently completed.

Fix by tracking progress with ProgressTracker.  ProgressTracker::begin()
gives the last completed crawl position.  ProgressTracker::end() gives
the last scheduled crawl position.  begin() does not advance if there
is any item between begin() and end() is not yet completed.  In between
are crawled extents that are on the task queue but not yet processed.
The file 'beescrawl.dat' saves the begin() position while the extent
scanning task queue is fed from the end() position.

Also remove an unused method crawl_state_get() and repurpose the
operator<(BeesCrawlState) that nobody was using.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-02-28 23:49:39 -05:00
Zygo Blaxell
90c32c3f05 crucible: MAP_32BIT is not defined on ARM
Also fix a stray #if that should be #ifdef.

Closes:  https://github.com/Zygo/bees/issues/59

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-02-25 10:08:44 -05:00
Zygo Blaxell
33d274eabd resolve: break up long intra-extent dedup loops
When both block candidates for dedup are located in the same extent, bees
excludes them from deduplication because the dedup operation would not
free any space (both blocks are still referenced, so neither is deleted).
Candidates in other extents are still considered.

Typically a few blocks are duplicated many thousands or even millions
of times within a filesystem.  Many of these blocks appear in the same
extent as each other.  In cases where an extent contains an extremely
common duplicate block, it may appear multiple times in many extents.
bees can get into a loop with a very bad worst-case running time:  32768
blocks per extent * 2560 bees reference limit * 256 distinct hash table
entries = 21.5 *billion* iterations...squared, because this loop happens
every time bees encounteres any of the references.  Not an infinite
number, but close enough.

In each iteration of the loop, replace_dst detects that both src and dst
block are part of the same btrfs extent data item and therefore should
not be deduped; however, this occurs after the block has been allocated
and read by chase_extent_ref.  This dst is discarded, but the outer
loop tries again with another reference to the same block and gets the
same result.

An easy fix for this problem is to stop the loop immediately when the
same physical extent is found in both src and dst.  The condition is rare
enough to ignore the negligible space efficiency loss, and filesystem
scan stops dead if the loop is allowed to proceed.  An exception is
thrown to terminate the loop at scan_one_extent from within replace_dst.

It would be better to determine the extent bytenr of each candidate
extent and filter them out in scan_one_extent (which reduces the number
of LOGICAL_INO calls as a side-effect), but bees has no code capable of
doing extent data tree lookups with backward iteration yet.  Even better
would be to change the hash table format so that the extent bytenr can
be decoded directly from the hash table entry (this already exists for
compressed extents).  Both of these changes are too large for v0.6.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-02-25 10:08:42 -05:00
Zygo Blaxell
2ac94438bd README: FD caches are now cleared every 10 transactions
Also some other minor editorial changes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-02-14 21:09:05 -05:00
Zygo Blaxell
9063c6442f README: clarify that bees is not to be used on old kernels
Also note that there is currently no released Linux kernel that is free
of relevant bugs.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-02-14 20:54:48 -05:00
Zygo Blaxell
86afa69cd1 cache: release lock before clearing
Clearing the FD cache could trigger a lot of inode evicts in the kernel,
which will block the cache entry destructors called by map::clear().
This prevents any cache lookups or new file opens while it happens.

Move the map to an auto variable and destroy it after releasing the
mutex lock.  This probably has the same net result (all the bees threads
will be blocked in the kernel instead of on a bees mutex), but at least
the problem is outside of userspace now.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-02-07 23:14:38 -05:00
Zygo Blaxell
8f0e88433e roots: get rid of common error messages, add more error counters
One very common case is losing a race to open a file that was deleted.
No need to spam the logs with mere ENOENT reports.

Other errors are more significant.  Log those with errno, and
add event counters to record them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-02-07 23:12:01 -05:00
Zygo Blaxell
5c1b45d67c extentwalker: remove wrong constraint check
Extents that extend past EOF will have ipos = (file size rounded up
to next block) and e.end() = (file size not rounded), which fails this
constraint check.

The constraint check is wrong.  Remove it for now.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-02-07 00:07:57 -05:00
Zygo Blaxell
6aad124241 crawl: somebody should set max_transid
The previous commit had both max_transid assigments commented out.
It happens to work because we set max_transid in the constructor and
it doesn't change after that, but it's cleaner to assign it explicitly.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-31 22:52:12 -05:00
Zygo Blaxell
087ec26c44 crawl: filter extents correctly
When an extent ref is modified, all of the refs in the same metadata
page get the same transid in the TREE_SEARCH_V2 header.  This causes
two problems:

	- Extents with generation < min_transid are included if they
	happen to be referenced by pages with generation >= min_transid.

	- Extent refs with generation > max_transid are excluded even
	if they reference extents with generation <= max_transid.

Both of these are wrong:  the first causes some extents to be repeatedly
scanned, the second causes some extents to not be scanned at all.

Change the TREE_SEARCH_V2 parameters so that Crawl sees all extents
newer than min_transid (i.e. set max_transid to max).  The TREE_SEARCH_V2
kernel logic already operates this way, i.e. it fetches every page with
transid >= min_transid and discards newer items if they are too new for
max_transid.  Filter strictly by the extent reference generation field
(i.e. the copy of the extent generation that is in the extent reference).

Note this still scans extent data multiple times, but it should now
be exactly once per extent reference.  A proper fix for this requires
extent-based scanning instead of extent-ref-based scanning.

Formerly commit 5a8c655fc4 "roots: filter
out obsolete extents from extent refs" which landed in the subvol-threads
branch but not master.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-31 22:48:39 -05:00
Kai Krakow
408b6ae138 Code style: Fix wrong indentation
This had spaces instead of tabs by accident.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-29 21:37:40 -05:00
Kai Krakow
e3c4a07216 Makefile: Unclutter "make test" output
This adds a .txt Makefile target to create a text file which receives
the test program output. In case the test failed, it will cat the
contents and fail the target.

Execution of each test itself is forced, so it would run every time make
is invoked, thus no failing test would be missed.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-29 21:37:40 -05:00
Kai Krakow
d8241a7720 README: Add notes about packaging
Give some pointers on how to package bees for a distribution.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-29 21:37:40 -05:00
Kai Krakow
5590fc0b13 Cmdline: Fix text alignment
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-29 21:37:40 -05:00
Kai Krakow
29d40ca359 Cmdline: Rename "relative-paths" to "strip-paths"
The previous name didn't match what this option really does.

Affects: #41

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-29 21:37:40 -05:00
Kai Krakow
b164717a25 Cmdline: Rename "notimestamps" to "no-timestamps"
That aligns better with the other options.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-29 21:37:40 -05:00
Zygo Blaxell
af250f7732 roots: determine transid_max without open()ing every subvol root
Scan the roots tree directly for roots other than 5 (the FS root), and
use btrfs_get_root_transid on root_fd for root 5.  This avoids filling
up the root FD cache every time we want a new transid_max.  Now the only
reason we open a subvol root FD is to open a file within the subvol.

transid_max may be the same as the FS root's transid, in which case
the search loop is not necessary.  Place a counter (transid_max_miss)
to see if we ever need to look at root items. If this counter never goes
above zero, or does so very rarely, we can delete the search loop.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-29 21:37:39 -05:00
Zygo Blaxell
4f0bc78a4c crawl: don't block a Task waiting for new transids
Task should not block for extended periods of time.

Remove the RateEstimator::wait_for() in crawl_roots.  When crawl_roots
runs out of data, let the last crawl_task end without rescheduling.
Schedule crawl_task again on transid polls if it was not already running.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-29 21:37:39 -05:00
Zygo Blaxell
b67fba0acd log: BEESLOGNOTE doesn't do what we think it does
BEESLOGNOTE was intended to combine BEESLOG and BEESNOTE, i.e. write a
log message and set the task status message from a single expression.
With the log levels we would now need several more variants
(BEESLOGNOTEDEBUG, BEESLOGNOTEERR...) or a parameter (BEESNOTELOG(DEBUG,
...)).

Or we give up on the idea.  This combination was used only 3 times so far.
The log messages and the note message have different editorial styles.

Remove the three instances of BEESLOGNOTE, and make the BEESLOGNOTE
definition equvalent to BEESLOG at LOG_NOTICE level for consistency.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-29 21:37:38 -05:00
Zygo Blaxell
92fda34a68 task: allow user access to ID and default constructor
The default constructor makes it more convenient to use Task as a
class member.

The ID is useful to disambiguate Task references.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-29 00:54:06 -05:00
Zygo Blaxell
2aacdcd95f time: add update_monotonic to RateEstimator
update_monotonic does not reset the counter if a new count is smaller than
earlier counts.  Useful when consuming an unsorted stream of eveent counts.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-29 00:51:13 -05:00
Zygo Blaxell
d367c6364c context: improve toxic match logs
Reword log message for discovery of new toxic extents vs. lookup of
previously known toxic extents.  Also add the block data (especially
filename) to the discovery message.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-29 00:48:06 -05:00
Zygo Blaxell
591a44e59a resolve: drop support for old-style compressed BeesAddr
No public version of bees ever created old-style compressed hash table
entries.  Remove the code that supports them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-29 00:48:06 -05:00
Zygo Blaxell
27125b8140 README: add scan-mode 2 and expand descriptions of modes 0 and 1
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-29 00:48:06 -05:00
Zygo Blaxell
636328fdc2 roots: add scan-mode 2 "oldest crawler first"
Add a third scan mode with alternative trade-offs.

Benefits:  Good sequential read performance.  Avoids race conditions
described in https://github.com/Zygo/bees/issues/27.  Avoids diverting
scan resources into short-lived snapshots before their long-lived
origin subvols are fully scanned.

Drawbacks:  Takes the longest time of the three implemented scan-modes
to free space in extents that are shared between snapshots.  Uses the
maximum amount of temporary space.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-29 00:48:05 -05:00
Zygo Blaxell
ef44947145 roots: move common code for creating crawl Tasks into a method
Duplicated code between the different scan modes has slowly been
becoming less and less trivial.  Move the code to a method and
make both scan-modes call it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-28 22:52:17 -05:00
Zygo Blaxell
72cc9c2b60 ExtentWalker: increase efficiency for typical btrfs extent sizes
Perf was blaming more than 50% of cycles on TREE_SEARCH_V2.  strace
showed 4 TREE_SEARCH_V2 calls for every pread in grow_backward().

Fix by increasing the extent fetch batch size so it is more likely
to include the desired items in the first fetch attempt.

This removes TREE_SEARCH_V2 from the top 10 list of cycle consumers.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-28 22:52:07 -05:00
Zygo Blaxell
e74c0a9d80 scan: fix length mismatch exception for prealloc extents at EOF
Prealloc extent sizes were taken from the Extent object and did not
take the file size into account.  If a file with a non-4K-aligned
size is preallocated, the resulting dedup fails with an exception
because the size of both ranges of the BeesRangePair do not match.

Limit the size of the replacement hole extent to not extend past the
end of the file.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-28 01:46:08 -05:00
Zygo Blaxell
762f833ab0 roots: poll every 10 transids
Restartng scans for each transid is a bit aggressive.  Scan every 10
transids for a polling rate close to the former BEES_COMMIT_INTERVAL.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:05 -05:00
Zygo Blaxell
48e78bbe82 roots: use RateEstimator as a transid_max cache and clean up logs
transid_max is now measured at a single point in the crawl_transid thread.

Move the Crawl deferred logic into BeesRoots so it restarts all crawls
when transid_max increases.  Gets rid of some messy time arithmetic.

Change name of Crawl thread to "crawl_master" in both thread name and
log messages.

Replace "Next transid" with "Crawl started".

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:05 -05:00
Zygo Blaxell
ded26ff044 FdCache: clear cache on every new transid / crawl cycle
The periodic cache age check was not protected by a lock, so multiple
threads may decide to concurrently clear the cache.  This led to
duplicate log messages.

Fix by moving the cache expiry trigger out of FdCache and into Roots,
which knows when transids change and can perform cache clears at exactly
the time they are most relevant, i.e. after something that was deleted
becomes permanently so.

This removes the last references to BEES_COMMIT_INTERVAL, so get rid
of its definition too.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:05 -05:00
Zygo Blaxell
72857e84c0 crawl: combine two messages per crawl cycle into one
Now that the polling interval is up to 30 times faster,
next_transid seems too verbose again.

Make it clearer that the interval quoted in the "Deferring..."
message is the computed transaction polling interval.

Combine "Next transid" and "Restarted crawl" into a single message.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:05 -05:00
Zygo Blaxell
0fdae37962 roots: use RateEstimator to track transids
Make the crawl polling interval more closely track the commit interval
on the btrfs filesystem.  In the future this will provide opportunities
to do things like clear FD caches and stop crawls on deleted subvols,
but triggered by transaction commits instead of arbitrary time intervals.

Rename the "crawl" thread so it no longer has the same name as the "crawl"
task, and repurpose it for dedicated transid polling.  Cancel the deletion
of crawl_thread and repurpose it to trigger new crawls and wake up the
main crawl Task when it runs out of data.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:05 -05:00
Zygo Blaxell
4694c7d250 time: add RateEstimator, a class for optimally polling irregular external events
RateEstimator estimates the rate of external events by sampling a
counter.

Conversion functions are provided to predict the time when the
event counter will be incremented to particular values based on past
observations of the event counter.

Synchronization functions are provided to block a thread until a specific
counter value is reached.

Event polling is supported using the history of previous event counts
to determine the predicted time of the next event.  A decay function
emphasizes more recent event history.

Polling delays are bounded by minimum and maximum values in the constructor
parameters.

wait_for() and wait_until() block the calling thread until the target
event count is reached (or the counter is reset).  These functions are
not bounded by min_delay or max_delay, and require a separate tread
to call update().  wait_for() waits for the counter to be incremented
from its current value by the given count.  wait_until() waits for the
counter to reach an absolute value.

update() counts external events and unblocks threads that are blocked
in wait_for() or wait_until().  If the event counter decreases then it
is reset to the new value.

duration() and time_point() convert relative and absolute event counts
into relative and absolute C++11 time quantities based on the last update
time, last observed event count, and the observed event rate.

Convenience functions seconds_for() and seconds_until() calculate
polling delays for for the desired relative and absolute event counts
respectively.  These delays are bounded by max and min delay parameters.

rate() and ratio() provide conversion factors based on the current
estimated event rate.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:05 -05:00
Zygo Blaxell
a3f02d5dec roots: comment updates and general cleanup
Fix discussion of nodatasum files, clarifying what we can and cannot do.

Get rid of some BEESNOTE and BEESTRACE calls which cannot be observed
(well, BEESNOTE can, but you have to be quick!).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:05 -05:00
Zygo Blaxell
f6909dac17 bees: drop BEESINFO
Having too many "write a message to the log" primitives is confusing,
and having one that intermittently and silently discards output is even
_more_ confusing.

Replace all BEESINFO with appropriate BEESLOG*s.  Usually DEBUG.
Except for one or two that occur too often.  Just delete those.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:05 -05:00
Zygo Blaxell
bd2a15733c README: update Linux kernel bugs list (v4.14)
Add the new WARN_ON bug in v4.14.

Clarify what happens when bees is run on a kernel that is too old.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:05 -05:00
Zygo Blaxell
4ecd467ca0 BeesBlockData: don't leak file contents in the log
The data field of BeesBlockData is only interesting to those who want
to debug the BeesBlockData implementation or other battle-tested parts
of bees.  Users who want to do this can modify and rebuild the source
to enable the output.

To everyone else, the data field is a huge, ongoing infoleak through
the log.

Don't bother with an option, just output the length of the data field
and nothing else.

Fixes:  https://github.com/Zygo/bees/issues/53

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:04 -05:00
Zygo Blaxell
71be53eff6 types: don't throw an exception when it's likely we are already reporting an exception
Empty files are a thing that can happen.  Don't bomb out just reporting
one's existence.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:04 -05:00
Zygo Blaxell
67ac537c5e time: drop unused Timer methods
Timer::set(double d) in particular seems...wrong.

Nothing uses them, so don't bother to fix them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:04 -05:00
Zygo Blaxell
f64fc78e36 Task: convert print_fn to a string
Since we are now unconditionally rendering the print_fn as a static
string, there is no need for it to be a function.  We also need it to
be brief and mostly constant.

Use a string instead.  Put the string before the function in the Task
constructor arguments so that the title string appears as a heading in
code, since we are making a breaking API change already.

Drop TASK_MACRO as it is broken by this change, but there is no similar
usage of Task anywhere to make it worth fixing.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:48:04 -05:00
Zygo Blaxell
0710208354 BeesNote: thread naming fixes
Move pthread_setname_np to the same place we do pthread_getname_np.

Detect errors in pthread_getname_np--but don't throw an exception
because we would call ourself recursively from the exception handler
when it tries to log the exception.

Fix the order of set_name and the first BEESNOTE/BEESLOG call in threads,
closing small time intervals where logs have the wrong thread name,
and that wrong name becomes persistent for the thread.

Make the main thread's name "bees" because Linux kernel stack traces use
the pthread name of the main thread instead of the name of the process.

Anonymous threads get the process name (usually "bees").  We should not
have any such threads, but we do.  This appears to occur mostly during
exception stack unwinding.  GCC/pthread bug?

Fixes:  https://github.com/Zygo/bees/issues/51

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-26 23:47:47 -05:00
Kai Krakow
c17618c371 README: Some things are simply no longer true
Environment variables are no longer the /only/ option.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-20 14:47:04 -05:00
Kai Krakow
dee6f189bb README: Fix markdown syntax error
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-20 14:47:04 -05:00
Kai Krakow
de6d7d6f25 Makefile: Get rid of test for-loop
Tests could now be run in parallel. Additionally, single tests can be
run by simply using "make testname", i.e. "make chatter" would run the
chatter test.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-20 14:44:27 -05:00
Kai Krakow
63f249f005 Makefile: force rebuilding tests when Makefile changed
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-20 14:43:37 -05:00
Kai Krakow
ca1a3bed12 Makefile: -lXXXXX is really a filename parameter
According to gcc docs, -l is converted to a filename which makes it a
filename parameter. Let's move it to the end.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-20 14:43:21 -05:00
Kai Krakow
d6312c338b Logging: Improve text layout when discarding log timestamps
When timestamps are removed from logging, the current text layout shows
lines like

tid 12345 thread_name: Example log

Let's convert it to a more conforming layout:

thread_name[12345]: Example log

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-20 14:42:49 -05:00
Zygo Blaxell
5533d09b3d Merge remote-tracking branch 'kakra/proposal/prepare-for-more-libs' 2018-01-20 14:23:55 -05:00
Zygo Blaxell
4c05c53d28 roots: update Task print functions for new usage
This restores the old "crawl" prefix in the case of Crawler log messages.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-20 14:00:52 -05:00
Zygo Blaxell
5063a635fc logging: get Task names for log messages
When a Task worker thread is executing a Task, the thread name is less
useful than the Task description.

Use the Task description instead of the thread name if the thread has
no BeesThread name and the thread is currently executing a task.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-20 14:00:51 -05:00
Zygo Blaxell
fef7aed8fa BeesNote: if thread name was not set, get it from Task or pthread_getname_np
Threads from the Task module in libcrucible don't set BeesNote::tl_name.
Even if they did, in Task context the thread name is unspecific to the point
of meaninglessness.

Use the Task::print method as the name for such threads, and be sure
that future Task print functions are designed for that usage.

The extra complexity in BeesNote::get_name() seems preferable to
bombarding pthread_setname_np hundreds or thousands of times per second.

FIXME:  we are now calling Task::print() on every BeesNote, which
is effectively unconditionally.  Maybe we should have Task::print()
and get_name() return a closure, or just evaluate Task::print() once
and cache it in TaskState, or define Task's constructor with a string
argument instead of the current print_fn closure.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-20 13:57:51 -05:00
Zygo Blaxell
3f60a0efde task: allow external access to Task print function
This enables bees' thread introspection to use task descriptions in
status and log messages.

BeesNote will be calling Task::current_task() from non-Task contexts,
which means we need to allow Task's shared state pointer to be null.
Remove some asserts that will ruin our day in that case.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-20 13:51:05 -05:00
Zygo Blaxell
e970ac6c02 crawl: make logging less verbose
Silence the three(!) log messages per crawl increment an extra one at
the end of the subvol.

The three critical messages per subvol crawl cycle are:

	Next transid in BeesCrawlState <SUBVOL>:0 offset 0x0 transid <A>..<B> started <T> (<AGO>s ago)

Subvol has been completely scanned and a new transaction range will
be created.  CrawlState is the state of the old subvol.

	Restarted crawl BeesCrawlState <SUBVOL>:0 offset 0x0 transid <B>..<C> started <T+AGO> (0s ago)

Subvol has been restarted.  CRawlState is the state of the new subvol.

	Deferring next transid in BeesCrawlState <SUBVOL>:0 offset 0x0 transid <B>..<C> started <T+AGO> (0s ago)

Subvol has been completely scanned, but it is too soon to start a
new scan.

Fix the "Restart..." message to use the correct verb tense and to use
the correct BeesCrawlState data.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-20 13:50:47 -05:00
Zygo Blaxell
38ccf5c921 counters: track pair growing time
When we find a matching block we attempt to extend ("grow") the matched
pair around the first matching block.  This function takes the IO hit of
reading the second extent from each duplicate extent pair.  It's also
very slow--too many allocations, too small reads, reads in the wrong
order, an order of magnitude too many calls to TREE_SEARCH_V2, and it
is usually in the top 3 most frequent PERFORMANCE warnings.

Start tracking the running time of grows using the pairforward_ms
and pairbackward_ms counters so that we can compare it to various
replacements.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-20 13:04:56 -05:00
Kai Krakow
826b27fde2 Makefile: Fix some dependencies
Some deps are already referenced by depends.mk, some where actually
missing.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-19 01:50:13 +01:00
Kai Krakow
8a5f790a03 Makefile: Some cleanups
Reorder and reformat some arguments so it looks more streamlined during
the build process.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-19 01:50:13 +01:00
Kai Krakow
677da5de45 Logging: Add log levels to output
This commit adds log levels to the output. In systemd, it makes colored
lines, otherwise it's probably just a number. Bees is very chatty, so
this paves the road for log level filtering.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-18 23:41:29 +01:00
Kai Krakow
d6b847db0d Makefile: speedup dependency generation
Dependencies can be generated in parallel which can be much faster. It
also puts away the problem that for may fail multiple times in a row and
leaving behind a broken intermediate file which would be picked up by
successive runs.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-18 22:53:00 +01:00
Kai Krakow
b8f933d360 Makefile: do not be verbose about mv
A small left-over from me fixing the same problem as Zygo did in his
merged branch.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-18 22:53:00 +01:00
Kai Krakow
27b12821ee Makefile: Generalize the .version.cc target
This enables us to move the file around later.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-18 22:53:00 +01:00
Kai Krakow
fdf434e8eb Makefile: fix dependency generation
Let's generalize the depends.mk target so we can easily move files
around later. While doing it, let's also fix the "gcc -M" call to use
explicit target names and not clobber it with preprocessor output.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-18 22:53:00 +01:00
Kai Krakow
bc1b67fde1 Makefile: rename OBJS to CRUCIBLE_OBJS
This paves the way for building different .so libs.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-18 22:53:00 +01:00
Kai Krakow
4cfd5b43da Makefile: generalize .so target
We can generalize the .so target by moving its depends into rules
without build instructions.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-18 22:53:00 +01:00
Kai Krakow
4789445d7b Makefile: .o already depends on its .h file
We can remove the explicit depend on the .h file because that is covered
by depends.mk. Let's instead depend on makeflags which makes more sense.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-18 22:53:00 +01:00
Kai Krakow
c8787fecd2 Makefile: depends.mk is not an optional include
We really need depends.mk in the following Makefile reorganization.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-18 22:53:00 +01:00
Zygo Blaxell
4943a07cce crucible: cache: linked-list LRU implementation
We need a better cache expiration algorithm than "make a copy of
the entire thing, sort it while holding a lock, and delete half
the items in a single burst."

Replace the Lamport clock with a double-linked list.  Each insert
or lookup operation moves the affected item to the head of the list.
Each erase operation deletes one single item at the tail of the list.

Also sort out some iterator invalidation nonsense by doing erases before
inserts instead of "insert, erase, find the inserted item again because
we invalidated the found iterator during the erase."

The new implementation adds a second word-sized member to each Value
as well as a copy of the Key.  Hopefully the enlarged size is not
a deal-breaker.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:58:44 -05:00
Zygo Blaxell
00d9b8ed76 hash: do the mlock after loading the table
The mlock runs much faster, probably because the hash fetches are
doing most of the work that mlock does.

It makes bees startup latency for testing smaller, even if it takes more
time in absolute terms.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:58:44 -05:00
Zygo Blaxell
e8b4ab54c6 README: describe the scanning mode (-m option)
Include a brief description of the two algorithms without getting
into too much detail for an ostensibly temporary feature.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:58:44 -05:00
Zygo Blaxell
56c23c4517 crawl: implement two crawler algorithms and adjust scheduling parameters
There are two subvol scan algorithms implemented so far.  The two modes
are unimaginatively named 0 and 1.

	0:  sorts extents by (inode, subvol, offset),

	1:  scans extents round-robin from all subvols.

Algorithm 0 scans references to the same extent at close to the same
time, which is good for performance; however, whenever a snapshot is
created, the scan of the entire filesystem restarts at the beginning of
the new snapshot.

Algorithm 1 makes continuous forward progress even when new snapshots
are created, but it does not benefit from caching and will force the
kernel to reread data multiple times when there are snapshots.

The algorithm can be selected at run-time using the -m or --scan-mode
option.

We can collect some field data on these before replacing them with
an extent-tree-based scanner.  Alternatively, for pre-4.14 kernels,
we can keep these two modes as non-default options.

Currently these algorithms have terrible names.  TODO:  fix that, but
also TODO: delete all that code and do scans directly from the extent
tree instead.

Augment the scan algorithms relative to their earlier implementation by
batching multiple extents to scan from each subvol before switching to
a different subvol.

Sprinkle some BEESNOTEs on the Task objects so that they don't
disappear from the thread status output.

Adjust some timing constants to deal with the increased latency from
competing threads.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:53:49 -05:00
Zygo Blaxell
055c8d4c75 roots: scan in parallel using Tasks
Distribute incoming extents across a thread pool for faster execution
on multi-core, multi-disk environments.

Switch extent enumeration model to scan extent refs consecutively(ish).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:52:00 -05:00
Zygo Blaxell
090d79e13b crucible: remove unused TimeQueue and WorkQueue classes
WorkQueue is superceded by Task.  TimeQueue will be replaced by
something based on Tasks.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:52:00 -05:00
Zygo Blaxell
796aaed7f8 roots: remove dead code and #if blocks
In both instances the code contained within (or the conditional
compilation surrounding it) is no longer controversial.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:52:00 -05:00
Zygo Blaxell
8849e57bf0 crucible: add Task class
We need a mechanism for distributing work across processor cores and
disks.

Task implements a simple FIFO/LIFO queue model for executing closures.
Some locking primitives are included (mutex and barrier).

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:51:59 -05:00
Zygo Blaxell
844a488157 README: update dependencies and Linux kernel bugs list
Bees will someday rely on features available only in kernel v4.14.

Let's start now by removing workarounds for bugs that were fixed in v4.11.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:51:59 -05:00
Zygo Blaxell
a175ee0689 bees: clean up #if 0 ... fsync ... #endif code
Remove some dead code because dedup-related deadlocks have not been
observed since Linux kernel v4.11.

Preserve rationale of remaining #if 0 block (why we do write/rename
instead of write/fsync/rename) so that people don't try to replace the
"missing" fsync() there.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:30:07 -05:00
Zygo Blaxell
f376b8e90d test: add -lpthread to Makefile
This resolves missing symbol build errors.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:30:07 -05:00
Zygo Blaxell
3da755713a Makefiles: don't append to depends.mk.new
Fixes errors such as:

	depends.mk:765: *** multiple target patterns.  Stop.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:30:07 -05:00
Zygo Blaxell
8d3a27bf85 subvol-threads: increase resource and thread limits
With kernel 4.14 there is no sign of the previous LOGICAL_INO performance
problems, so there seems to be no need to throttle threads using this
ioctl.

Increase the FD cache size limits and scan thread count.  Let the kernel
figure out scheduling.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:30:07 -05:00
Zygo Blaxell
42a6053229 roots: remove open_root_cache correctly
BEESNOTE puts a message on the status message stack.  BEESINFO logs a
message with rate limiting.  The message that was flooding the logs
was coming from BEESINFO not BEESNOTE.

Fix earlier commit which removed the wrong message.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:30:07 -05:00
Zygo Blaxell
c477618924 crucible: resource: optimize map cleanup
We were holding weak refs until the next time the resource ID was used.
This is a bad thing if resource IDs are sparse (e.g. pointers or hashes)
because we'll never see an ID twice.

To fix, determine whether we released the last instance of a resource,
and if so, free its weak ref immediately.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:30:07 -05:00
Zygo Blaxell
35100c2b9e crucible: resource: remove excess locking
The bugs in other parts of the code have been identified and fixed,
so the overprotective locks around shared_ptr can be removed.

Keep the other improvements to the Resource class.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:30:06 -05:00
Zygo Blaxell
116f15ace5 lockset: drop unused method wait_unlock
This function is not used and does not appear to be useful.

Remove it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-17 22:30:06 -05:00
Zygo Blaxell
8a68b5f20b crucible: add cleanup class
Store a function (or closure) in an instance and invoke the function
from the destructor.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-15 11:07:48 -05:00
Kai Krakow
6d6aedd8ec Makefile: Fail gracefully if markdown is not installed
Previously, MARKDOWN may end up empty. This commit should fix it.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-11 21:25:12 +01:00
Kai Krakow
025b14f38f Installation: Depend Gentoo ebuild on markdown
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-11 20:56:48 +01:00
Kai Krakow
dd6d8caaa2 Installation: Remove superfluous cruft from Gentoo ebuild
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-11 20:56:34 +01:00
Zygo Blaxell
4bfb637b0e Merge remote-tracking branch 'nefelim4ag/master' 2018-01-10 23:43:00 -05:00
Zygo Blaxell
4aa5978a89 hash: reduce mutex contention using one mutex per hash table extent
This avoids PERFORMANCE warnings when large hash tables are used on slow
CPUs or with lots of worker threads.  It also simplifies the code (no
locksets, only one object-wide mutex instead of two).

Fixed a few minor bugs along the way (e.g. we were not setting the dirty
flag on the right hash table extent when we detected hash table errors).

Simplified error handling:  IO errors on the hash table are ignored,
instead of throwing an exception into the function that tried to use the
hash table.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-10 23:25:45 -05:00
Kai Krakow
365a913a26 Installation: Add Gentoo ebuild
This commit adds an ebuild for Gentoo. Version 9999 is building live
from current git, currently using kakra:integration because it has some
installation and build fixes important for Gentoo.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-11 03:03:17 +01:00
Kai Krakow
634a1d0bf6 Installation: -fPIC should not be used unconditionally
According to Gentoo packaging guide, -fPIC should only be used on shared
libraries, and not added unconditionally to every linker call.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-11 02:30:12 +01:00
Kai Krakow
3a24cd3010 Installation: Fix soname QA warning in Gentoo
Gentoo warns about libs missing a proper soname during QA phase. Let's
fix this.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-11 02:30:12 +01:00
Kai Krakow
3391593cb9 Installation: Keep version tag in a variable
To prepare soname handling, we need to keep the version tag in a
variable.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-11 02:02:36 +01:00
Kai Krakow
fdd8350239 Installation: Improve filesystem layout flexibility
In preparation for Gentoo QA checks during ebuild merge phase, let's
make some more of the filesystem layout adjustable.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-11 02:02:36 +01:00
Kai Krakow
60cd9c6165 Installation: Introduce DESTDIR into Makefile
In Gentoo, usage of DESTDIR is automatically handled by the build system
to support installation into a clean image from which the package is
created.

Thus, let's add DESTDIR to the install targets. One can now correctly
install bees with packaging systems simply by running:

$ DESTDIR=/tmp/bees-image make all install

This will no longer mess up with the PREFIX setting.

CC: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-10 22:35:22 +01:00
Kai Krakow
f0e02478ef Installation: Document optional dependency on blkid
If using `scripts/beesd`, we need `blkid` which is part of util-linux.
It should be available on every distribution but let's document it
anyway.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-10 21:47:01 +01:00
Kai Krakow
421641e242 Makefile: Document scripts/beesd
Add a paragraph about the helper script `scripts/beesd` to automatically
setup and configure bees.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-10 21:06:49 +01:00
Kai Krakow
0fce10991b Installation: Add Arch Linux instructions
Closes #34

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-10 20:43:54 +01:00
Kai Krakow
a465d997bd Makefile: Document Makefile changes 2018-01-10 20:41:56 +01:00
Kai Krakow
361ef0bebf Installation: Add new section to README 2018-01-10 20:41:37 +01:00
Kai Krakow
1fcf07cc2a Installation: Prepare README
Rename a section in preparation for a new install section.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-10 20:41:00 +01:00
Kai Krakow
333b2f7746 Makefile improvement
Now you can make bees fly as pointed out in the README... ;-)
2018-01-10 20:09:38 +01:00
Timofey Titovets
ff9e0e3571 Fix: exec bees - breaks bash trap handling of umount bees workdir
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2018-01-09 23:25:57 +01:00
Timofey Titovets
2d49d98bd2 Fix: exec bees - breaks bash trap handling of umount bees workdir
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2018-01-09 22:33:57 +03:00
Kai Krakow
92aa13a6ae Add beesd@.service to gitignore
It's a generated file. We should ignore it, so it won't be accidently be
checked in.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-09 01:56:22 +01:00
Kai Krakow
f0c516f33b Makefile: let "make install" install the complete distribution
It happened more than once that I ran just "make install" only, which
doesn't install the scripts.

Let's fix this by renaming the previous install target to install_bees,
and then make a new install target which depends on each install target
and thus installs the complete distribution.

It doesn't hurt to install those few scripts. I don't see the point in
separating the install targets as it was previously done.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-09 01:54:41 +01:00
Kai Krakow
8e2139d6ed Makefile: depend install_scripts on scripts
For consistency with the other install target, let's depend
install_scripts on its build targets.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-09 01:51:18 +01:00
Kai Krakow
b959af1a15 systemd: Provide URL and better description
Let's direct users to the support site when they ask systemd for help
about the service unit, or by looking at error messages.

Also, let's adjust the description to be more pleasing to the eyes. The
previous long description with uncommon formatting really stuck out in
the boot logs.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-09 01:32:17 +01:00
Kai Krakow
78f96a9fbd systemd: Don't start without essential system services
Starting bees right after local-fs.target is probably not what we want,
as basic setup of the system might not have been done (like udev,
cryptsetup, sysctl, swap, etc).

Let's start only after sysinit.target instead which guarantees that all
basic setup has been done, most importantly, sysctl, udev, and swap have
been setup which may apply important tweaks, configuration, and tuning.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-09 01:29:05 +01:00
Kai Krakow
953c158868 systemd: Don't start in system-update.target
Due to bees installing into the local-fs.target, bees also runs during
system-update.target. This should not be done, system-update.target is
meant as an isolated bootup mode for applying updates offline, that is:
Only essential services are running.

Fix this by making it WantedBy basic.target instead. According to
system-update.target and "man bootup", system-update.target pulls in
sysinit.target, as does basic.target. So essentially, basic.target is
not part of the system-update.target transaction.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-09 01:24:35 +01:00
Kai Krakow
f7f99f52b5 Generalize sed invocation rule
Remove the redundant sed call by generalizing the rule to apply sed to
.in templates.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-09 00:48:14 +01:00
Kai Krakow
abeb6e74b2 Add scripts to "make all" target
This prevents scripts being generated by "root" during "sudo make
install" phase.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-09 00:48:14 +01:00
Kai Krakow
6c67ae0d5e Don't zap localconf in "make clean"
When you run "make clean", localconf is being removed. This is probably
in most cases not intentional.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2018-01-09 00:48:14 +01:00
Zygo Blaxell
ba981c133a Merge remote-tracking branches 'kakra/feature/add-relative-path-option' and 'kakra/integration' 2018-01-07 21:39:01 -05:00
Zygo Blaxell
9d295fab4e Makefile: if multiple Markdown utilities are present, use the first one
If two utilities are found, we get commands like

	/usr/bin/markdown /usr/bin/markdown_py README.md > README.html

and that doesn't work.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-07 20:23:20 -05:00
Zygo Blaxell
dc7360397e README: update the state of bees and the kernel for v4.14
Read-only snapshots have always just worked.  Remove them from the
"untested" list.

nodatasum (and therefore nodatacow) inodes are simply ignored.  This seems
like the right thing to do since deduping a nodatacow extent turns it
into a datacow extent, which seems contrary to administrator wishes
implied by the nodatacow bit.  We probably need an option to
override that assumption.

Clarify why converted ext[234] filesystems may cause problems and
the nature of those problems.

Assorted minor editorial changes.

Discuss calculation of the balance limit parameter when ensuring
sufficient metadata space.

Update kernel version bug/fix/feature lists, including LOGICAL_INO_V2.

Annotate kernel workaround list with known kernel versions that make
the workarounds necessary.

Remove reference to 'DEFRAG_RANGE' as bees requires much more control
over data placement than this interface can offer.  It's easy enough to
create a new ioctl to implement bees requirements once it's known what
those requirements are.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2018-01-07 20:23:20 -05:00
Zygo Blaxell
305ab5dbfa Merge remote-tracking branch 'nefelim4ag/master' 2018-01-06 22:54:49 -05:00
Timofey Titovets
80e4302958 Update btrfs compression types, add ZSTD, drop LAST
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2018-01-04 20:32:04 +03:00
Zygo Blaxell
07751885d2 error: drop redundant CHECK_CONSTRAINT
CHECK_CONSTRAINT is just THROW_CHECK1 with an inconsistent name.
Remove it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-12-21 14:00:17 -05:00
Zygo Blaxell
77614a0e99 scan: insert toxic matched extents into hash table as they are discovered
When a toxic extent is discovered, insert the offending hash/address/toxic
entry into the hash table.

When a previously discovered toxic extent is encountered, do nothing,
i.e. allow the offending hash/address/toxic entry in the hash table
to expire.

Previously both inserts were removed from the code, but the former one
is required.  The latter prevents bees from forgiving toxic extents
(or any hash matching one) should they be relocated, deleted, or simply
become non-toxic.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-12-21 13:56:15 -05:00
Zygo Blaxell
649ae5bb40 makeflags: fix missing -D_FILE_OFFSET_BITS=64 in comment
Interesting things happen when blindly swapping the release-build CCFLAGS
with the debug-build commented-out CCFLAGS.  None of these things that
happen are good.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-12-21 13:28:49 -05:00
Timofey Titovets
40112faf0f Makefile add scripts target for correctly packaging
Currently scheme lead to path like:
/tmp/makepkg/bees-git/pkg/bees-git/usr/lib/bees/bees

While packaging, so allow do:
make
make scripts
make install ...
make install_scripts ...

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2017-12-21 13:26:45 +03:00
Kai Krakow
bfb768a079 Fix a fallthrough error in GCC 7+
GCC 7 and higher turn a previous warning into an error for implicit
fallthrough. Let's hint the compiler that this is intentional here.

Signed-off-by: Kai Krakow <kai@kaishome.de>
(cherry picked from commit 270a91cf17)
2017-11-14 11:26:47 -05:00
Kai Krakow
3024e43355 Fix a fallthrough error in GCC 7+
GCC 7 and higher turn a previous warning into an error for implicit
fallthrough. Let's hint the compiler that this is intentional here.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2017-11-14 07:00:28 +01:00
Kai Krakow
270a91cf17 Fix a fallthrough error in GCC 7+
GCC 7 and higher turn a previous warning into an error for implicit
fallthrough. Let's hint the compiler that this is intentional here.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2017-11-14 06:58:43 +01:00
Kai Krakow
93ba0f48de Make clear that options must be supplied in one variable
Previously, expectations may fail when just uncommenting both lines.
2017-11-14 06:58:43 +01:00
Kai Krakow
d930136484 Remove process forking from frontend script
Now with the patches integrated to filter logging output, we can finally
remove forking a subprocess and stop redirecting file descriptors.

We instead use exec to replace the process with the final daemon.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2017-11-14 06:58:43 +01:00
Kai Krakow
f7320baa56 Fix indentation/alignment after integration 2017-11-14 06:58:43 +01:00
Kai Krakow
21212cd3e3 Fix example config for timestamp logging 2017-11-14 06:58:43 +01:00
Kai Krakow
0c6a4d00c8 Remove filter path logic from frontend script
Now with relative path filtering in place, we can now give sub spawning
subshells in the frontend script.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2017-11-14 01:16:06 +01:00
Kai Krakow
52997936d5 getopt: Add logic to set relative path from $CWD
This commit adds a new option to set relative path output for name_fd().

Signed-off-by: Kai Krakow <kai@kaishome.de>
2017-11-14 01:16:06 +01:00
Kai Krakow
755f16a948 crucible: Allow setting a relative path option for name_fd()
This commit adds an option to store a relative path in prepartion for
more human-friendly log output.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2017-11-14 01:16:06 +01:00
Zygo Blaxell
71514e7229 main: use static function to control timestamps in log output
Adjust bees to match changes in Chatter's interface.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 66fd28830d)
2017-11-11 15:18:46 -05:00
Zygo Blaxell
78d04b1417 chatter: use static function to control timestamping behavior
Use a static function instead of embedding side-effects in the constructor
of an unrelated class.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 85106bd9a9)
2017-11-11 15:18:46 -05:00
Kai Krakow
47805253e6 Make service starter accept bees options
The service starter wasn't able to pass options to the new getopt
parser. This commit fixes it.
2017-10-28 00:14:36 +02:00
Kai Krakow
629e33b4f3 Fix naming 2017-10-28 00:13:38 +02:00
Kai Krakow
58157d03dd Add beesd generated script to gitignore 2017-10-28 00:13:23 +02:00
Kai Krakow
9d67329ef7 Update README after integrating new features
Let's update the README file.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2017-10-27 23:10:14 +02:00
Kai Krakow
c6be07e158 Add option for prefixing timestamps
To make bees more friendly to use with syslog/systemd, we add an option
to omit timestamps from the log output.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2017-10-27 23:02:47 +02:00
Kai Krakow
c6bf6bfe1d Implement getopt options parser
This commit adds a simple getopt options parser to show help. This can
be used as a boilerplate for adding more options later.

Signed-off-by: Kai Krakow <kai@kaishome.de>
2017-10-27 22:36:00 +02:00
Kai Krakow
29d2d51c47 Fix libexec prefix discrepancy
Whoops...
2017-09-20 21:50:58 +02:00
Kai Krakow
893595190f Allow custom libexec location
To install for different distributions, LIBEXEC_PREFIX can now be set.
It defaults to $(PREFIX)/usr/lib/bees as used in most common
distributions.

Local overrides are possible by setting variables in a "localconf" file
which will be included by the Makefile if it exists.

For some distributions you may want to set it to /usr/libexec or
/usr/libexec/bees.
2017-09-20 21:00:54 +02:00
Kai Krakow
0455827989 Adjust CPU and IO shares when running under systemd
Let's remove the CPUQuota example and instead give bees a share of
what's available.

128 CPU shares will give it about 12% max CPU under load, give it a
slight boost during startup to allow reading the hash table faster.

100 block shares will give it about 10% max disk bandwidht under load,
give it a slight boost during startup to allow reading the hash table
faster.

Then let's adjust the CPU and IO scheduler to prefer other processes.
This way bees runs completely in the background, barely noticable
during, e.g., gaming.
2017-09-19 21:07:37 +02:00
Kai Krakow
62626aef7f Adjust service restart and shutdown behavior
Explicitly set control-group kill mode, that is: try SIGTERM first, and
use SIGKILL after a timeout. This exactly defines how bees is running as
a child process within the frontend service starter. Not sure if bees cares
about signals but SIGTERM first seems cleaner. On the way, let bees restart
on abnormal termination.
2017-09-19 21:07:37 +02:00
Kai Krakow
f59e311809 Explicitly mark systemd unit as Type=simple
Bees does not fork, so let's not rely on systemd defaults.
2017-09-19 21:07:37 +02:00
Kai Krakow
3bf4e69c4d Make config example more clear
A pre-defined UUID should not be part of the sample config file.

Instead, make it more clear how the config file is intended to be used.
2017-09-19 21:06:18 +02:00
Kai Krakow
5622ebd411 Bees is meant to be run as root only
As bees is meant to be run as root only, move it to /usr/sbin which is
usually not part of normal users path environment.
2017-09-19 20:32:09 +02:00
Kai Krakow
04cb25bd04 Move bees to libexec install dir
When bees is meant to be run mainly through the service frontend script,
we should move the bees binary to the libexec directory.
2017-09-19 20:30:51 +02:00
Zygo Blaxell
06b8fd8697 Merge remote-tracking branch 'kakra/master' 2017-09-18 22:34:20 -04:00
Zygo Blaxell
94ab477b90 Merge remote-tracking branch 'kakra/feature/markdown-detection' 2017-09-18 22:32:41 -04:00
Kai Krakow
cceb0480a5 Change README.md reflecting nodatacow inode attribute
The previous patch changed behavior regarding nodatacow inode attribute.
Let's document the new behavior.
2017-09-18 01:45:20 +02:00
Kai Krakow
23749eb634 Enable detect of markdown binary
Some distributions do not provide markdown as "markdown". Let's figure
out which version to use during build.
2017-09-18 01:30:59 +02:00
Zygo Blaxell
5afbcb99e3 roots: drop open_root_nocache log entry
After a few hundred subvol threads start running, the inode cache starts
to thrash, and the log gets spammed with messages of the form:

	"open_root_nocache <subvolid>: <path>"

Ideally there would be some way to schedule work to minimize inode
thrashing.  Until that gets done, just silence the messages for now.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-09-16 21:16:40 -04:00
Zygo Blaxell
5275249396 roots: trace transid_max calculation
transid_max calculations can take considerable time.  Report their
progress in more detail.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-09-16 17:30:45 -04:00
Zygo Blaxell
a07728bc7e tmpfiles: note that kernel race condition is not yet fixed
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-09-16 17:30:36 -04:00
Zygo Blaxell
732896b471 log: simplify output for dedup and scan
With many threads it is inconvenient to reassemble the elided parts of
the dedup src/dst and scan filenames output.  Simply output them
unconditionally, and balance the line lengths.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-09-16 17:30:30 -04:00
Zygo Blaxell
5cc5a44661 bees: drop unused BeesWorkQueue classes
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-09-16 17:30:22 -04:00
Zygo Blaxell
f6a6992ac9 README: update list of currently known kernel bugs
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-09-16 17:28:50 -04:00
Zygo Blaxell
ceda8ee6c3 Makefile: add test to PHONY list
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-09-16 17:27:36 -04:00
Zygo Blaxell
18ae15658e README: remove stray whitespace
No content changes.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-09-16 14:57:24 -04:00
Zygo Blaxell
339579096f roots: move flags check after file identity checks and make error message style consistent
If we lose a race and open the wrong file, we will not retry with the
next path if the file we opened had incompatible flags.  We need to keep
trying paths until we open the correct file or run out of paths.
Fix by moving the inode flag check after the checks for file identity.

Output attributes in hex to be consistent with other attribute error
messages.

There is no need to report root and file paths separately in the error
message for incompatible flags because we have confirmed the identity of
the file before the incompatible flag error is detected.  Other messages
in this loop still output root path and file_path separately because
the identity of 'rv' is unknown at the time these messages are emitted.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-09-16 14:49:09 -04:00
Zygo Blaxell
702a8eec8c bees: use ioctl_iflags_get and ioctl_iflags_set instead of opencoded versions
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-09-16 14:31:43 -04:00
Zygo Blaxell
5f18fcda52 crucible: add ioctl_iflags_set to complement ioctl_iflags_get
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-09-16 14:31:27 -04:00
Zygo Blaxell
088cbd24ff Merge branch 'master' of git://github.com/kakra/bees 2017-09-16 13:55:58 -04:00
Coenraad Loubser
8c9a44998d Verbatim Ubuntu build instructions
And link to work done so far on 14.04... (Doesn't work yet)
2017-09-16 09:40:52 +02:00
Kai Krakow
a5e2bdff47 Skip nocow files to speed up processing
If you have a lot of or a few big nocow files (like vm images) which
contain a lot of potential deduplication candidates, bees becomes
incredibly slow running through a lot "invalid operation" exceptions.

Let's just skip over such files to get more bang for the buck. I did no
regression testing as this patch seems trivial (and I cannot imagine any
pitfalls either). The process progresses much faster for me now.
2017-09-12 02:09:22 +02:00
Zygo Blaxell
703bb7c1a3 bees: use handle type for hash table extent locks
Fixes build breakage after "crucible: lockset: track lockers and use
handle type".

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-06-17 10:22:06 -04:00
Zygo Blaxell
4f66d1cb44 crucible: lockset: track lockers and use handle type [bees master branch edition]
Keep track of the locking thread so we can see why we are deadlocked
in gdb.

Use a handle type for locks based on shared_ptr.  Change the handle type
name to flush out any non-auto local variables.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit aa0b22d445)
2017-06-17 10:21:33 -04:00
Zygo Blaxell
3901962379 bees: trace calls to BeesResolver
This helps identify causes of the "same physical address in dedup"
exception.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit cc7b4f22b5)
2017-06-17 10:15:11 -04:00
Zygo Blaxell
48aac8a99a bees: drop unused constants
BLOCK_SIZE_MIN_EXTENT_DEFRAG, BLOCK_SIZE_MIN_EXTENT_SPLIT, and others
are no longer used.  Remove them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit a3d7032eda)
2017-06-17 10:15:11 -04:00
Zygo Blaxell
b0ba4c4f38 bees: time tmpfile create and copy operations
Add time spent in file create and copy operations to the stats.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit f01c20f972)
2017-06-17 10:15:11 -04:00
Zygo Blaxell
74d256f0fe bees: handle trace functions that throw exceptions
A BEESTRACE closure could throw an exception.  Trap those so we don't
end up in terminate().

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 59660cfc00)
2017-06-17 10:15:11 -04:00
Zygo Blaxell
8cde833863 bees: make a thread note when we read data
Reads can block indefinitely due to bugs, low io priority, or poor
storage performance.  Record the block origin data in the thread state
so we can see which reads are problematic.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit f56f736d28)
2017-06-17 10:15:11 -04:00
Zygo Blaxell
e0951ed4ba bees: use C++11 syntax for constant initializers
This lets us use more default constructors.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 8a932a632f)
2017-06-17 10:15:11 -04:00
Zygo Blaxell
c479b361cd bees: remove file open serialization mutex
It is no longer necessary.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 5c91045557)
2017-06-17 10:15:11 -04:00
Zygo Blaxell
c6c3990d19 bees: types: improve serialization of byte ranges
Use () instead of [] when the respective end of the byte range touches
the beginning or end of the file.  Also omit the '0' at beginning of
file.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 3023b7f57a)
2017-06-17 10:15:11 -04:00
Zygo Blaxell
3fdc217b4f bees: change formatting for physical bytenr ranges in dedup
Use a different character to make it easier to search for bytenr ranges
in the logs.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit d43199e3d6)
2017-06-17 10:15:08 -04:00
Zygo Blaxell
6c8d2bf428 bees: limit FD cache size explicitly
This will allow the default size limit for cache objects to be changed
with impunity.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 9daa51edaa)
2017-06-17 10:15:08 -04:00
Zygo Blaxell
d6f97edf4a crucible: fs: keep ioctl buffer between runs
perf blames the SEARCH_V2 ioctl wrapper for a lot of time spent in malloc.
Use a thread_local buffer for ioctl results, and reuse it between runs.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit e509210428)
2017-06-17 10:15:08 -04:00
Zygo Blaxell
312254a47b crucible: cache: no need to use explicit lock type
C++11 'auto' keyword is sufficient.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
(cherry picked from commit 44fedfc928)
2017-06-17 10:14:25 -04:00
Timofey Titovets
5350b0f113 Bees: fix [-Werror=implicit-fallthrough=]
In gcc 7+ warning: implicit-fallthrough has been added
In some places fallthrough is expectable, disable warning

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2017-06-13 18:05:38 +03:00
Zygo Blaxell
5a3f1be09e Merge git://github.com/Nefelim4ag/bees 2017-02-09 20:01:29 -05:00
Timofey Titovets
4b592ec2a3 Check: if disk with UUID are btrfs by blkid
Old check can't find btrfs fs, if fs not mounted

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2017-02-09 11:56:31 +03:00
Zygo Blaxell
dc00dce842 context: purge FD cache every COMMIT_INTERVAL
Holding file FDs open for long periods of time delays inode destruction.
For very large files this can lead to excessive delays while bees dedups
data that will cease to be reachable.

Use the same workaround for file FDs (in the root_ino cache) that
is used for subvols (in the root cache):  forcibly close all cached
FDs at regular intervals.  The FD cache will reacquire FDs from files
that still have existing paths, and will abandon FDs from files that
no longer have existing paths.  The non-existing-path case is not new
(bees has always been able to discover deleted inodes) so it is already
handled by existing code.

Fixes: https://github.com/Zygo/bees/issues/18

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-02-08 22:01:00 -05:00
Timofey Titovets
82b3ba76fa Makefile: make service install compatible with debian systems
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2017-01-30 05:29:28 +03:00
Zygo Blaxell
4113a171be crucible: cache: clean up use of iterators
check_overflow() will invalidate iterators if it decides there are too
many cache entries.

If items are deleted from the cache, search for the inserted item again
to ensure the iterator is valid.

Increase size of timestamp to size_t.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-23 21:12:34 -05:00
Zygo Blaxell
5713fcd770 bees: clean up statistics class
Some whitespace fixes.  Remove some duplicate code.  Don't lock
two BeesStats objects in the - operator method.

Get the locking for T& at(const K&) right to avoid locking a mutex
recursively.  Make the non-const version of the function private.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-22 22:00:28 -05:00
Zygo Blaxell
db8ea92133 bees: fix further instances of copy-after-unlock bug
Before:

        unique_lock<mutex> lock(some_mutex);
        // run lock.~unique_lock() because return
        // return reference to unprotected heap
        return foo[bar];

After:

        unique_lock<mutex> lock(some_mutex);
        // make copy of object on heap protected by mutex lock
        auto tmp_copy = foo[bar];
        // run lock.~unique_lock() because return
        // pass locally allocated object to copy constructor
        return tmp_copy;

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-22 22:00:27 -05:00
Zygo Blaxell
6099bf0b01 crucible: fix further instances of copy-after-unlock bug
Before:

	unique_lock<mutex> lock(some_mutex);
	// run lock.~unique_lock() because return
	// return reference to unprotected heap
	return foo[bar];

After:

	unique_lock<mutex> lock(some_mutex);
	// make copy of object on heap protected by mutex lock
	auto tmp_copy = foo[bar];
	// run lock.~unique_lock() because return
	// pass locally allocated object to copy constructor
	return tmp_copy;

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-22 22:00:27 -05:00
Zygo Blaxell
c58e5cd75b crucible: cache: construct return value before releasing lock
If we release the lock first (and C++ destructor order says we do), then
the return value will be constructed from data living in an unprotected
container object.  That data might be destroyed before we get to the
copy constructor for the return value.

Make a temporary copy of the return value that won't be destroyed by any
other thread, then unlock the mutex, then return the copy object.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-22 12:15:07 -05:00
Paul Jones
123d4e83c5 Remove reference to *.c files in Makefile
On Gentoo it errors out because there is no *.c


Signed-off-by: Paul Jones <paul@pauljones.id.au>
2017-01-22 16:49:50 +11:00
Zygo Blaxell
5de3b15daa src: Update bees-version.c more often
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-18 22:17:03 -05:00
Zygo Blaxell
38fffa8e27 lib: add a version string
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-18 22:17:02 -05:00
Zygo Blaxell
50417e961f crucible: rework the Resource class
Get rid of the ResourceHolder class.

Fix GCC static template member instantiation issues.

Replace assert() with exceptions.

shared_ptr can't seem to do reference counting in a multi-threaded
environment.  The code looks correct (for both ResourceHandle and
std::shared_ptr); however, continual segfaults don't lie.

Carpet-bomb with mutex locks to reduce the likelihood of losing shared_ptr
races.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-18 22:09:18 -05:00
Zygo Blaxell
6cc9b267ef crucible: time: fix uninitialized member
Found by valgrind.  It was mostly harmless because the range of
usable values is limited by m_burst (which was initialized) and 0.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-16 22:02:14 -05:00
Zygo Blaxell
9f120e326b bees: fix deadlock in thread status reporting
"s_name" was a thread_local variable, not static, and did not require a
mutex to protect access.  A deadlock is possible if a thread triggers an
exception with a handler that attempts to log a message (as the top-level
exception handler in bees does).

Remove multiple unnecessary mutex locks.  Rename the thread_local variables
to make their scope clearer.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-15 01:55:34 -05:00
Zygo Blaxell
382f8bf06a hash: prevent eleventy-gigabyte core dumps
Add MADV_DONTDUMP to the list of advice flags.

There are now three flags which may or may not be supported by the
target kernel.  Try each one and log its success or failure separately.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-12 22:55:08 -05:00
Zygo Blaxell
5e91529ad2 hash: remove the unused m_prefetch_rate_limit
The hash table statistics calculation in BeesHashTable::prefetch_loop
and the data-driven operation of the extent scanner always pulls the
hash table into RAM as fast as the disk will push the data.  We never
use the prefetch rate limit, so remove it.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-11 21:15:12 -05:00
Zygo Blaxell
bddc07bd28 hash: make thread status message more consistent
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-11 21:15:12 -05:00
Zygo Blaxell
ffe2a767d3 crucible: extentwalker: add compressed() and bytenr() methods
Also use C++11 syntax for construction.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-11 21:15:11 -05:00
Zygo Blaxell
845267821c main: count arguments correctly
Replace one braindead mistake for another.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-10 01:10:38 -05:00
Zygo Blaxell
3138002a1f main: ArgList would silently drop the first argument
This fixes a bug where bees tries to process itself as a btrfs filesystem.
This is a species of bug that I only notice *after* pushing to a public
git repo.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-09 23:42:02 -05:00
Zygo Blaxell
4a57c5f499 README: update copyright year, remove some obsolete statements 2017-01-09 23:32:33 -05:00
Zygo Blaxell
bda4638048 crucible: LockSet: add a maximum size constraint
Extend the LockSet class so that the total number of locked (active)
items can be limited.  When the limit is reached, no new items can be
locked until some existing locked items are unlocked.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-09 23:23:51 -05:00
Zygo Blaxell
fa8607bae0 crucible: get rid of DefaultBool, just use C++11 initializer syntax
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-09 23:23:32 -05:00
Zygo Blaxell
1b261b1ba7 build: move BEES_VERSION to a separate C file to avoid unnecessary building
Every git commit was causing bees.cc and bees-hash.cc to be rebuilt,
which was expensive and unnecessary.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-09 23:23:05 -05:00
Zygo Blaxell
6980935463 README: "btrfs: improve delayed refs iterations" has been merged into v4.10-rc1
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-09 00:05:18 -05:00
Zygo Blaxell
cf04fb17de crucible: remove unused execpipe
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-08 23:45:05 -05:00
Zygo Blaxell
4a9f26d12e crucible: remove ArgList and drop the unimplemented interpreter classes
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-08 23:45:05 -05:00
Zygo Blaxell
e8eaa7e471 trivial: mass purge of whitespace errors
Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2017-01-06 22:14:50 -05:00
Timofey Titovets
22e601912e Make filters configurable
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2016-12-30 04:26:42 +03:00
Timofey Titovets
badfa6e9b9 Add filter to remove time from bees output
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2016-12-29 17:04:47 +03:00
Timofey Titovets
03609f73db Add help section to Makefile
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2016-12-29 13:40:13 +03:00
Timofey Titovets
7f92f22dea Add install_scripts subcommand to make
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2016-12-29 13:30:08 +03:00
Timofey Titovets
f7c71a7a25 Add install subcommand to make
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2016-12-29 13:27:47 +03:00
Timofey Titovets
37713c2dd4 Scripts: Remove code for short path name in log
Commit: "log: remove path from thread name" remove path from logs, so this useless

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2016-12-29 13:14:17 +03:00
Zygo Blaxell
65a950bc41 README.md: 32-bit hosts work now 2016-12-27 18:01:30 -05:00
Zygo Blaxell
ef8d92a3cb resolve: don't stop at the first physical address lookup failure
The btrfs LOGICAL_INO ioctl has no way to report references to compressed
blocks precisely, so we must always consider all references to a
compressed block, and discard those that do not have the desired offset.

When we encounter compressed shared extents containing a mix of unique
and duplicate data, we attempt to replace all references to the mixed
extent with the same number of references to multiple extents consisting
entirely of unique or duplicate blocks.  An early exit from the loop
in BeesResolver::for_each_extent_ref was stopping this operation early,
after replacing as few as one shared reference.  This left other shared
references to the unique data on the filesystem, effectively creating
new dup data.

The failing pattern looks like this:

    dedup: replace 0x14000..0x18000 from some other extent
    copy: 0x10000..0x14000
    dedup: replace 0x10000..0x14000 with the copy
    [may be multiple dedup lines due to multiple shared references]
    copy: 0x18000..0x1c000
    [missing dedup 0x18000..0x1c000 with the copy here]
    scan: 0x10000 [++++dddd++++] 0x1c000

If the extent 0x10000..0x1c000 is shared and compressed, we will make
a copy of the extent at 0x18000..1c0000.  When we try to dedup this
copy extent, LOGICAL_INO will return a mix of references to the data
at logical 0x10000 and 0x18000 (which are both references to the
original shared extent with different offsets).  If we break out
of the loop too early, we will stop as soon as a reference to 0x10000
is found, and ignore all other references to the extent we are trying
to remove.

The copy at the beginning of the extent (0x10000..0x14000) usually
works because all references to the extent cover the entire extent.
When bees performs the dedup at 0x14000..0x18000, bees itself creates
the shared references with different offsets.

Uncompressed extents were not affected because LOGICAL_INO can locate
physical blocks precisely if they reside in uncompressed extents.

This change will hurt performance when looking up old physical addresses
that belong to new data, but that is a much less urgent problem.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2016-12-27 15:23:40 -05:00
Zygo Blaxell
6e7137f282 bees: work around btrfs fsync bug
btrfs provides a flush on rename when the rename target exists, so the
fsync is not necessary.  In the initialization case (when the rename
target does not exist and the implicit flush does not occur), the file
may be empty or a hole after a crash.  Bees treats this case the same
as if the file did not exist.  Since this condition occurs for only the
first 15 minutes of the lifetime of a bees installation, it's not worth
bothering to fix.

If we attempt to fsync the file ourselves, on a crash with log replay,
btrfs will end up with a directory entry pointing to a non-existent inode.
This directory entry cannot be deleted or renamed except by deleting
the entire subvol.  On large filesystems this bug is triggered by nearly
every crash (verified on kernels up to 4.5.7).

Remove the fsync to avoid the btrfs bug, and accept the failure mode
that occurs in the first 15 minutes after a bees install.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2016-12-27 15:20:31 -05:00
Zygo Blaxell
c1e31004b6 crawl: change scan order to make forward progress at all times
Previously, the scan order processed each subvol in order.  This required
very large amounts of temporary disk space, as a full filesystem scan
was required before any shared extents could be deduped.  If the hash
table RAM was underprovisioned this would mean some shared dup blocks
were removed from the hash table before they could be deduped.

Currently the scan order takes the first unscanned extent from each
subvol.  This works well if--and only if--the subvols are either empty
or children of a common ancestor.  It forces the same inode/offset pairs
to be read at close to the same time from each subvol.

When a new snapshot is created, this ordering diverts scanning to the
new subvol until it catches up to the existing subvols.  For large
filesystems with frequent snapshot creation this means that the scanner
never reaches the end of all subvols.  Each new subvol effectively
resets the current scan position for the entire filesystem to zero.
This prevents bees from ever completing the first filesystem scan.

Change the order again, so that we now read one unscanned extent from
each subvol in round-robin fashion.  When a new subvol is created, we
share scan time between old and new subvols.  This ensures we eventually
finish scanning initial subvols and enter the incremental scanning state.

The cost of this change is more repeated reading of shared extents at
scan time with less benefit from disk-device-level caching; however, the
only way to really fix this problem is to implement scanning on tree 2
(the btrfs extent tree) instead of the subvol trees.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2016-12-27 15:15:42 -05:00
Zygo Blaxell
7ecead1700 doc: comment updates
We stopped using FIEMAP for a number of reasons.  Document some of them.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2016-12-27 15:15:42 -05:00
Zygo Blaxell
efda609f66 log: remove path from thread name
The thread name has an arbitrarily limited size, and we are eventually
removing support for multiple paths in a single bees daemon process.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2016-12-27 15:15:16 -05:00
Zygo Blaxell
abd696c524 build: add -D_FILE_OFFSET_BITS=64 to makeflags to build on 32-bit hosts
Also update the tests to insist that off_t be at least 64 bits wide.
2016-12-14 19:02:01 -05:00
Zygo Blaxell
e835e8766e crucible: use set instead of vector in BtrfsExtentWalker
This gets rid of some more big memsets.  It may replace them
with a lot of tiny mallocs, though.  If this turns out to be
a bad idea then at least we can easily revert the change.
2016-12-13 21:46:41 -05:00
Zygo Blaxell
7782b79e4b crucible: reduce buffer size and CPU overhead for BtrfsIoctlSearchKey
We really do need some large buffers for BtrfsIoctlSearchKey in some
cases, but we don't need to zero them out first.  Don't do that so we
save some CPU.

Reduce the default buffer size to 4K because most BISK users don't get
need much more than 1K.  Set the buffer size explicitly to the product of
the number of items and the desired item size in the places that really
need a lot of items.
2016-12-13 21:46:35 -05:00
Paul Jones
d7c065e17e Add native compiler optimization's to compiler flags
Signed-off-by: Paul Jones <paul@pauljones.id.au>
2016-12-13 12:53:29 +11:00
Paul Jones
334f5f83ee Remove unused crc64 function
Signed-off-by: Paul Jones <paul@pauljones.id.au>
2016-12-13 12:52:26 +11:00
Paul Jones
8abdeabddc Make crc64 go faster
The current crc64 algorithm is a variant of the Redis implementation.
Change it to a variant of the Adler implementation as described
at https://matt.sh/redis-crcspeed

Test program at https://github.com/PeeJay/crc64-compare
Filesize: 1.1G
Asking crc64-redis to sum "/media/peejay/BTRFS/1/ubuntu-14.04.5-desktop-amd64.iso"...
Asking crc64-adler to sum "/media/peejay/BTRFS/1/ubuntu-14.04.5-desktop-amd64.iso"...
Redis CRC-64: f971f9ac6c8ba458
Adler CRC-64: f971f9ac6c8ba458
Adler throughput: 1659.913308 MB/s
Redis throughput: 437.284661 MB/s
Adler is 3.79x faster than Redis

Signed-off-by: Paul Jones <paul@pauljones.id.au>
2016-12-13 12:41:10 +11:00
Zygo Blaxell
f5f4d69ba3 lib: In 2016, Ubuntu still insists on topologically sorted libraries while linking
This fixes builds on Ubuntu Server 16.04.

Fixes: https://github.com/Zygo/bees/issues/8
2016-12-11 19:53:32 -05:00
Zygo Blaxell
ec9d4a1d15 crucible: fs: use a much smaller default search buffer size
It turns out we never use a value for m_buf_size that isn't the default,
and we also never ask for more than a few thousand items; however,
we do spend a ton of time memsetting the huge buffer to zero.

I don't know what the ideal size is, but 16K is a far better guess
than 1MB.  Let's reduce it for some immediate CPU benefit, and determine
what the size should be later.

Reported at https://github.com/Zygo/bees/issues/11
2016-12-11 13:24:44 -05:00
Zygo Blaxell
77c11bb90f bees: add version string and put it in main() and stats file
Now that we have more than one bees release it's somewhat important
to know which one each bug report is for...
2016-12-08 23:55:59 -05:00
Zygo Blaxell
b5c01c1985 hash: don't throw an exception if MADV_HUGEPAGE fails
We don't _need_ transparent hugepages.  We like them because they can
be faster, but it's not a requirement, and some people will disable
transparent hugepages because they make non-Bees-like workloads slow.

Try to use MADV_HUGEPAGE, but if it fails, just log the error and
continue.

MADV_DONTFORK would be useful if we still fork()ed, but we don't currently
do that.  It's still a useful flag to have because a fork() with more
than 50% of RAM in mlocked pages would result in a kernel OOM crash.
I don't think it's possible to run Bees on a kernel that does not support
the MADV_DONTFORK flag, so don't bother checking for that flag separately.
2016-12-08 23:55:59 -05:00
Zygo Blaxell
d82909387d README: upgrade kernel requirement to 4.4.3 because of kernel bugs 2016-12-08 23:55:58 -05:00
Zygo Blaxell
1cd6263552 README: document impact of 7f8e406 ("btrfs: improve delayed refs iterations") 2016-12-08 23:55:57 -05:00
Zygo Blaxell
eec80944cd roots: add a counter for crawl_ms, open_root and open_root_ino
Linux kernel commit 7f8e406 ("btrfs: improve delayed refs iterations")
seems to dramatically improve LOGICAL_INO performance.  Hopefully this
commit will find its way into mainline Linux soon.

This means that most of the time in Bees is now spent on block reading
(50-75%); however, there is still a big gap between block read and
the sum of everything else we are measuring with the "*_ms" counters.
This gap is about 30% of the run time, so it would be good to find out
what's in the gap.

Add ms counters around the crawl and open calls to capture where we are
spending all the time.
2016-12-08 23:55:39 -05:00
Zygo Blaxell
5a4ff9a0b8 Merge remote-tracking branch 'nefelim4ag/master' 2016-12-02 00:35:51 -05:00
Zygo Blaxell
9506406cff README: BEESHOME is now relative, UUIDs removed, resizing, file contents 2016-12-02 00:32:32 -05:00
Zygo Blaxell
1c4af5ce5a main: update usage message
BEESHOME is downgraded from required to optional.

Don't document the deprecated shared hash table feature.
2016-12-02 00:32:32 -05:00
Zygo Blaxell
642581e89a hash: remove the experimental shared hash-table and shared mmap features
The experiments are over, and the results were not success.

Having two filesystems cohabiting in the same hash table results in a
lot of false positives, each of which requires some heavy IO to resolve.

Using MAP_SHARED to share a beeshash.dat between processes results in
catastrophically bad performance.

These features were abandoned long ago, but some of the code--and even
worse, its documentation--still remains.

Bees wants a hash table false positive rate below 0.1%.  With a shared
hash table the FP rate is about the same as the dedup rate.  Typically
duplicate files on one filesystem are duplicate on many filesystems.

One or more of Linux VFS and the btrfs mmap(MAP_SHARED) implementation
produce extremely poor performance results.  A five-order-of-magnitude
speedup was achieved by implementing paging in userspace with worker
threads.  We no longer need the support code for the MAP_SHARED case.

It is still possible to run many BeesContexts in a single process,
but now the only thing contexts share is the FD cache.
2016-12-02 00:26:02 -05:00
Zygo Blaxell
fdfa78a81b context: default and relative BEESHOME
Allow relative paths with BEESHOME.  These paths will be relative
to the root of the dedup target filesystem.

BEESHOME is now optional.  If not specified, '.beeshome' is used.

We don't try to create BEESHOME if it doesn't exist.  BEESHOME might
not be on a btrfs filesystem, so we can't insist it be a subvol.
2016-12-02 00:22:18 -05:00
Zygo Blaxell
6fa8de660b hash: create beeshash.dat if it does not exist
BeesHashTable can now create a beeshash.dat if the file does not already
exist.  Currently the default size is one hash table extent (16MB) and
there's no way to change that (yet), so users should still create their
own hash tables for now.

The opening of the hash table is deferred (slightly) in preparation for
hash table resizing.

No doc as the feature is currently unfinished.
2016-12-02 00:20:30 -05:00
Zygo Blaxell
d58de9b76d bees: introduce BEESLOGNOTE macro
Quite often we have the same message in BEESLOG and BEESNOTE, so
make a macro to combine them.
2016-12-02 00:20:29 -05:00
Zygo Blaxell
ea0910ee6c crucible: fd: remove dead reference to unlink_or_die, introduce ftruncate_or_die 2016-12-02 00:19:37 -05:00
Zygo Blaxell
dd21e6f848 crucible: add missing template specializations of pwrite helper functions
I got a little too enthusiastic when redacting the code, and removed some
overloaded functions bees was using.  C++ silently found replacements,
and the result was a bug that prevented any data from being persisted
from the hash table.

Fixes: https://github.com/Zygo/bees/issues/7
2016-12-02 00:16:51 -05:00
Zygo Blaxell
06e111c229 crawl: remove UUID from file names
Unfortunately we don't get to remove the libuuid dependency because
we still want to read a file that exists in the legacy location.
2016-12-02 00:16:03 -05:00
Timofey Titovets
606d48acc1 Add option to make mnt path shorter in logs
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2016-11-28 08:23:50 +03:00
Timofey Titovets
bf4e31ae71 Add default values to vars
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2016-11-27 06:23:42 +03:00
Timofey Titovets
03c116c3f1 Add Systemd service for bash wrapper
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2016-11-27 03:19:31 +03:00
Timofey Titovets
a384cd976a Add bash wrapper
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
2016-11-27 03:19:19 +03:00
Zygo Blaxell
38bb70f5d0 build: OK, maybe 32-bit machines could work
I accidentally did a pre-push verification on a 32-bit build host.
There were a surprisingly small number of problems, so fix them.

Bees now builds on a 32-bit host.  Let's not update README just yet,
though:  the 32-bit ioctl support fails immediately after startup on a
64-bit kernel.
2016-11-26 02:06:28 -05:00
Zygo Blaxell
a57404442c execpipe: remove unreachable debug code
This is tripping up builds in stricter build environments.

https://github.com/Zygo/bees/issues/2
2016-11-26 01:06:44 -05:00
Zygo Blaxell
1e621cf4e7 README: Improve "about" section and update compiler dependency
"agent" is a nice generic term for the set of things that userspace
btrfs deduplicators are.  Let's call it that.

Throw out the awkward and rambling "About" text and use the announcement
from linux-btrfs instead.  Terrible English writing I at am.
2016-11-24 23:06:28 -05:00
Zygo Blaxell
1303fb9da8 build: fix FTBFS on GCC 6.2
I'm not surprised that GCC 6 doesn't let me send an ostream ref to itself,
even inside an uninstantiated template specialization.  I am a little
surprised I was trying to, and 4.9 let me get away with it.

It's 2016.  auto_ptr is deprecated now.

Some things were including vector that don't any more.

https://github.com/Zygo/bees/issues/1
2016-11-24 22:20:11 -05:00
Zygo Blaxell
876b76d761 README.md: answer some questions that came in after release 2016-11-17 15:13:47 -05:00
111 changed files with 13998 additions and 4251 deletions

6
.gitignore vendored
View File

@@ -1,6 +1,8 @@
*.[ao]
*.bak
*.dep
*.new
*.tmp
*.so*
Doxyfile
README.html
@@ -10,3 +12,7 @@ html/
latex/
make.log
make.log.new
localconf
lib/configure.h
scripts/beesd
scripts/beesd@.service

9
Defines.mk Normal file
View File

@@ -0,0 +1,9 @@
MAKE += PREFIX=$(PREFIX) LIBEXEC_PREFIX=$(LIBEXEC_PREFIX) ETC_PREFIX=$(ETC_PREFIX)
define TEMPLATE_COMPILER =
sed $< >$@ \
-e's#@DESTDIR@#$(DESTDIR)#' \
-e's#@PREFIX@#$(PREFIX)#' \
-e's#@ETC_PREFIX@#$(ETC_PREFIX)#' \
-e's#@LIBEXEC_PREFIX@#$(LIBEXEC_PREFIX)#'
endef

View File

@@ -1,19 +1,71 @@
default install all: lib src test README.html
PREFIX ?= /usr
ETC_PREFIX ?= /etc
LIBDIR ?= lib
clean:
git clean -dfx
LIB_PREFIX ?= $(PREFIX)/$(LIBDIR)
LIBEXEC_PREFIX ?= $(LIB_PREFIX)/bees
.PHONY: lib src
SYSTEMD_SYSTEM_UNIT_DIR ?= $(shell pkg-config systemd --variable=systemdsystemunitdir)
lib:
$(MAKE) -C lib
BEES_VERSION ?= $(shell git describe --always --dirty || echo UNKNOWN)
# allow local configuration to override above variables
-include localconf
DEFAULT_MAKE_TARGET ?= reallyall
ifeq ($(DEFAULT_MAKE_TARGET),reallyall)
RUN_INSTALL_TESTS = test
endif
include Defines.mk
default: $(DEFAULT_MAKE_TARGET)
all: lib src scripts
reallyall: all doc test
clean: ## Cleanup
git clean -dfx -e localconf
.PHONY: lib src test doc
lib: ## Build libs
+$(MAKE) TAG="$(BEES_VERSION)" -C lib
src: ## Build bins
src: lib
$(MAKE) -C src
+$(MAKE) BEES_VERSION="$(BEES_VERSION)" -C src
test: ## Run tests
test: lib src
$(MAKE) -C test
+$(MAKE) -C test
README.html: README.md
markdown README.md > README.html.new
mv -f README.html.new README.html
doc: ## Build docs
+$(MAKE) -C docs
scripts/%: scripts/%.in
$(TEMPLATE_COMPILER)
scripts: scripts/beesd scripts/beesd@.service
install_bees: ## Install bees + libs
install_bees: src $(RUN_INSTALL_TESTS)
install -Dm755 bin/bees $(DESTDIR)$(LIBEXEC_PREFIX)/bees
install_scripts: ## Install scipts
install_scripts: scripts
install -Dm755 scripts/beesd $(DESTDIR)$(PREFIX)/sbin/beesd
install -Dm644 scripts/beesd.conf.sample $(DESTDIR)$(ETC_PREFIX)/bees/beesd.conf.sample
ifneq ($(SYSTEMD_SYSTEM_UNIT_DIR),)
install -Dm644 scripts/beesd@.service $(DESTDIR)$(SYSTEMD_SYSTEM_UNIT_DIR)/beesd@.service
endif
install: ## Install distribution
install: install_bees install_scripts
help: ## Show help
@fgrep -h "##" $(MAKEFILE_LIST) | fgrep -v fgrep | sed -e 's/\\$$//' | sed -e 's/##/\t/'
bees: reallyall
fly: install

411
README.md
View File

@@ -1,360 +1,61 @@
BEES
====
Best-Effort Extent-Same, a btrfs deduplication daemon.
Best-Effort Extent-Same, a btrfs deduplication agent.
About Bees
About bees
----------
Bees is a daemon designed to run continuously on live file servers.
Bees consumes entire filesystems and deduplicates in a single pass, using
minimal RAM to store data. Bees maintains persistent state so it can be
interrupted and resumed, whether by planned upgrades or unplanned crashes.
Bees makes continuous incremental progress instead of using separate
scan and dedup phases. Bees uses the Linux kernel's `dedupe_file_range`
system call to ensure data is handled safely even if other applications
concurrently modify it.
Bees is intentionally btrfs-specific for performance and capability.
Bees uses the btrfs `SEARCH_V2` ioctl to scan for new data
without the overhead of repeatedly walking filesystem trees with the
POSIX API. Bees uses `LOGICAL_INO` and `INO_PATHS` to leverage btrfs's
existing metadata instead of building its own redundant data structures.
Bees can cope with Btrfs filesystem compression. Bees can reassemble
Btrfs extents to deduplicate extents that contain a mix of duplicate
and unique data blocks.
Bees includes a number of workarounds for Btrfs kernel bugs to (try to)
avoid ruining your day. You're welcome.
How Bees Works
--------------
Bees uses a fixed-size persistent dedup hash table with a variable dedup
block size. Any size of hash table can be dedicated to dedup. Bees will
scale the dedup block size to fit the filesystem's unique data size
using a weighted sampling algorithm. This allows Bees to adapt itself
to its filesystem size without forcing admins to do math at install time.
At the same time, the duplicate block alignment constraint can be as low
as 4K, allowing efficient deduplication of files with narrowly-aligned
duplicate block offsets (e.g. compiled binaries and VM/disk images).
The Bees hash table is loaded into RAM at startup (using hugepages if
available), mlocked, and synced to persistent storage by trickle-writing
over a period of several hours. This avoids issues related to seeking
or fragmentation, and enables the hash table to be efficiently stored
on Btrfs with compression (or an ext4 filesystem, or a raw disk, or
on CIFS...).
Once a duplicate block is identified, Bees examines the nearby blocks
in the files where block appears. This allows Bees to find long runs
of adjacent duplicate block pairs if it has an entry for any one of
the blocks in its hash table. The stored hash entry plus the block
recently scanned from disk form a duplicate pair. On typical data sets,
this means most of the blocks in the hash table are redundant and can
be discarded without significant performance impact.
Hash table entries are grouped together into LRU lists. As each block
is scanned, its hash table entry is inserted into the LRU list at a
random position. If the LRU list is full, the entry at the end of the
list is deleted. If a hash table entry is used to discover duplicate
blocks, the entry is moved to the beginning of the list. This makes Bees
unable to detect a small number of duplicates (less than 1% on typical
filesystems), but it dramatically improves efficiency on filesystems
with many small files. Bees has found a net 13% more duplicate bytes
than a naive fixed-block-size algorithm with a 64K block size using the
same size of hash table, even after discarding 1% of the duplicate bytes.
Hash Table Sizing
-----------------
Hash table entries are 16 bytes each (64-bit hash, 52-bit block number,
and some metadata bits). Each entry represents a minimum of 4K on disk.
unique data size hash table size average dedup block size
1TB 4GB 4K
1TB 1GB 16K
1TB 256MB 64K
1TB 16MB 1024K
64TB 1GB 1024K
Things You Might Expect That Bees Doesn't Have
----------------------------------------------
* There's no configuration file or getopt command line option processing
(patches welcome!). There are some tunables hardcoded in the source
that could eventually become configuration options.
* There's no way to *stop* the Bees daemon. Use SIGKILL, SIGTERM, or
Ctrl-C for now. Some of the destructors are unreachable and have never
been tested. Bees will repeat some work when restarted.
* The Bees process doesn't fork and writes its log to stdout/stderr.
A shell wrapper is required to make it behave more like a daemon.
* There's no facility to exclude any part of a filesystem (patches
welcome).
* PREALLOC extents and extents containing blocks filled with zeros will
be replaced by holes unconditionally.
* Duplicate block groups that are less than 12K in length can take 30%
of the run time while saving only 3% of the disk space. There should
be an option to just not bother with those.
* There is a lot of duplicate reading of blocks in snapshots. Bees will
scan all snapshots at close to the same time to try to get better
performance by caching, but really fixing this requires rewriting the
crawler to scan the btrfs extent tree directly instead of the subvol
FS trees.
* Bees had support for multiple worker threads in the past; however,
this was removed because it made Bees too aggressive to coexist with
other applications on the same machine. It also hit the *slow backrefs*
on N CPU cores instead of just one.
Good Btrfs Feature Interactions
-------------------------------
Bees has been tested in combination with the following:
* btrfs compression (either method), mixtures of compressed and uncompressed extents
* PREALLOC extents (unconditionally replaced with holes)
* HOLE extents and btrfs no-holes feature
* Other deduplicators, reflink copies (though Bees may decide to redo their work)
* btrfs snapshots and non-snapshot subvols (RW only)
* Concurrent file modification (e.g. PostgreSQL and sqlite databases, build daemons)
* all btrfs RAID profiles (people ask about this, but it's irrelevant)
* IO errors during dedup (read errors will throw exceptions, Bees will catch them and skip over the affected extent)
* Filesystems mounted *with* the flushoncommit option
* 4K filesystem data block size / clone alignment
* 64-bit CPUs (amd64)
* Large (>16M) extents
* Huge files (>1TB--although Btrfs performance on such files isn't great in general)
* filesystems up to 25T bytes, 100M+ files
Bad Btrfs Feature Interactions
------------------------------
Bees has not been tested with the following, and undesirable interactions may occur:
* Non-4K filesystem data block size (should work if recompiled)
* 32-bit CPUs (x86, arm)
* Non-equal hash (SUM) and filesystem data block (CLONE) sizes (probably never will work)
* btrfs read-only snapshots (never tested, probably wouldn't work well)
* btrfs send/receive (receive is probably OK, but send requires RO snapshots. See above)
* btrfs qgroups (never tested, no idea what might happen)
* btrfs seed filesystems (does anyone even use those?)
* btrfs autodefrag mount option (never tested, could fight with Bees)
* btrfs nodatacow mount option or inode attribute (*could* work, but might not)
* btrfs out-of-tree kernel patches (e.g. in-band dedup or encryption)
* btrfs-convert from ext2/3/4 (never tested)
* btrfs mixed block groups (don't know a reason why it would *not* work, but never tested)
* open(O_DIRECT)
* Filesystems mounted *without* the flushoncommit option
Other Caveats
-------------
* btrfs balance will invalidate parts of the dedup table. Bees will
happily rebuild the table, but it will have to scan all the blocks
again.
* btrfs defrag will cause Bees to rescan the defragmented file. If it
contained duplicate blocks and other references to the original
fragmented duplicates still exist, Bees will replace the defragmented
extents with the original fragmented ones.
* Bees creates temporary files (with O_TMPFILE) and uses them to split
and combine extents elsewhere in btrfs. These will take up to 2GB
during normal operation.
* Like all deduplicators, Bees will replace data blocks with metadata
references. It is a good idea to ensure there are several GB of
unallocated space (see `btrfs fi df`) on the filesystem before running
Bees for the first time. Use
btrfs balance start -dusage=100,limit=1 /your/filesystem
If possible, raise the `limit` parameter to the current size of metadata
usage (from `btrfs fi df`) plus 1.
A Brief List Of Btrfs Kernel Bugs
---------------------------------
Fixed bugs:
* 3.13: `FILE_EXTENT_SAME` ioctl added. No way to reliably dedup with
concurrent modifications before this.
* 3.16: `SEARCH_V2` ioctl added. Bees could use `SEARCH` instead.
* 4.2: `FILE_EXTENT_SAME` no longer updates mtime, can be used at EOF.
Kernel deadlock bugs fixed.
* 4.7: *slow backref* bug no longer triggers a softlockup panic. It still
too long to resolve a block address to a root/inode/offset triple.
Unfixed kernel bugs (as of 4.5.7) with workarounds in Bees:
* *slow backref*: If the number of references to a single shared extent
within a single file grows above a few thousand, the kernel consumes CPU
for up to 40 uninterruptible minutes while holding various locks that
block access to the filesystem. Bees avoids this bug by measuring the
time the kernel spends performing certain operations and permanently
blacklisting any extent or hash where the kernel starts to get slow.
Inside Bees, such blocks are marked as 'toxic' hash/block addresses.
* `LOGICAL_INO` output is arbitrarily limited to 2730 references
even if more buffer space is provided for results. Once this number
has been reached, Bees can no longer replace the extent since it can't
find and remove all existing references. Bees refrains from adding
any more references after the first 2560. Offending blocks are
marked 'toxic' even if there is no corresponding performance problem.
This places an obvious limit on dedup efficiency for extremely common
blocks or filesystems with many snapshots (although this limit is
far greater than the effective limit imposed by the *slow backref* bug).
* `FILE_EXTENT_SAME` is arbitrarily limited to 16MB. This is less than
128MB which is the maximum extent size that can be created by defrag
or prealloc. Bees avoids feedback loops this can generate while
attempting to replace extents over 16MB in length.
* `DEFRAG_RANGE` is useless. The ioctl attempts to implement `btrfs
fi defrag` in the kernel, and will arbitrarily defragment more or
less than the range requested to match the behavior expected from the
userspace tool. Bees implements its own defrag instead, copying data
to a temporary file and using the `FILE_EXTENT_SAME` ioctl to replace
precisely the specified range of offending fragmented blocks.
* When writing BeesStringFile, a crash can cause the directory entry
`beescrawl.UUID.dat.tmp` to exist without a corresponding inode.
This directory entry cannot be renamed or removed; however, it does
not prevent the creation of a second directory entry with the same
name that functions normally, so it doesn't prevent Bees operation.
The orphan directory entry can be removed by deleting its subvol,
so place BEESHOME on a separate subvol so you can delete these orphan
directory entries when they occur (or use btrfs zero-log before mounting
the filesystem after a crash).
* If the fsync() BeesTempFile::make_copy is removed, the filesystem
hangs within a few hours, requiring a reboot to recover.
Not really a bug, but a gotcha nonetheless:
* If a process holds a directory FD open, the subvol containing the
directory cannot be deleted (`btrfs sub del` will start the deletion
process, but it will not proceed past the first open directory FD).
`btrfs-cleaner` will simply skip over the directory *and all of its
children* until the FD is closed. Bees avoids this gotcha by closing
all of the FDs in its directory FD cache every 15 minutes.
Requirements
------------
* C++11 compiler (tested with GCC 4.9)
Sorry. I really like closures.
* btrfs-progs (tested with 4.1..4.7)
Needed for btrfs.h and ctree.h during compile.
Not needed at runtime.
* libuuid-dev
TODO: remove the one function used from this library.
It supports a feature Bees no longer implements.
* Linux kernel 4.2 or later
Don't bother trying to make Bees work with older kernels.
It won't end well.
* 64-bit host and target CPU
This code has never been tested on a 32-bit target CPU.
A 64-bit host CPU may be required for the self-tests.
Some of the ioctls don't work properly with a 64-bit
kernel and 32-bit userspace.
Build
-----
Build with `make`.
The build produces `bin/bees` and `lib/libcrucible.so`, which must be
copied to somewhere in `$PATH` and `$LD_LIBRARY_PATH` on the target
system respectively.
Setup
-----
Create a directory for bees state files:
export BEESHOME=/some/path
mkdir -p "$BEESHOME"
Create an empty hash table (your choice of size, but it must be a multiple
of 16M). This example creates a 1GB hash table:
truncate -s 1g "$BEESHOME/beeshash.dat"
chmod 700 "$BEESHOME/beeshash.dat"
Configuration
-------------
The only runtime configurable options are environment variables:
* BEESHOME: Directory containing Bees state files:
* beeshash.dat | persistent hash table (must be a multiple of 16M)
* beescrawl.`UUID`.dat | state of SEARCH_V2 crawlers
* beesstats.txt | statistics and performance counters
* BEESSTATS: File containing a snapshot of current Bees state (performance
counters and current status of each thread).
Other options (e.g. interval between filesystem crawls) can be configured
in src/bees.h.
Running
-------
We created this directory in the previous section:
export BEESHOME=/some/path
Use a tmpfs for BEESSTATUS, it updates once per second:
export BEESSTATUS=/run/bees.status
bees can only process the root subvol of a btrfs (seriously--if the
argument is not the root subvol directory, Bees will just throw an
exception and stop).
Use a bind mount, and let only bees access it:
mount -osubvol=/ /dev/<your-filesystem> /var/lib/bees/root
Reduce CPU and IO priority to be kinder to other applications
sharing this host (or raise them for more aggressive disk space
recovery). If you use cgroups, put bees in its own cgroup, then reduce
the `blkio.weight` and `cpu.shares` parameters. You can also use
`schedtool` and `ionice in the shell script that launches bees:
schedtool -D -n20 $$
ionice -c3 -p $$
Let the bees fly:
bees /var/lib/bees/root >> /var/log/bees.log 2>&1
You'll probably want to arrange for /var/log/bees.log to be rotated
periodically. You may also want to set umask to 077 to prevent disclosure
of information about the contents of the filesystem through the log file.
bees is a block-oriented userspace deduplication agent designed to scale
up to large btrfs filesystems. It is an offline dedupe combined with
an incremental data scan capability to minimize time data spends on disk
from write to dedupe.
Strengths
---------
* Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
* Daemon mode - incrementally dedupes new data as it appears
* Largest extents first - recover more free space during fixed maintenance windows
* Works with btrfs compression - dedupe any combination of compressed and uncompressed files
* Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
* Persistent hash table for rapid restart after shutdown
* Constant hash table size - no increased RAM usage if data set becomes larger
* Works on live data - no scheduled downtime required
* Automatic self-throttling - reduces system load
* btrfs support - recovers more free space from btrfs than naive dedupers
Weaknesses
----------
* Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
* Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
* [First run may increase metadata space usage if many snapshots exist](docs/gotchas.md)
* Constant hash table size - no decreased RAM usage if data set becomes smaller
* btrfs only
Installation and Usage
----------------------
* [Installation](docs/install.md)
* [Configuration](docs/config.md)
* [Running](docs/running.md)
* [Command Line Options](docs/options.md)
Recommended Reading
-------------------
* [bees Gotchas](docs/gotchas.md)
* [btrfs kernel bugs](docs/btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
* [bees vs. other btrfs features](docs/btrfs-other.md)
* [What to do when something goes wrong](docs/wrong.md)
More Information
----------------
* [How bees works](docs/how-it-works.md)
* [Missing bees features](docs/missing.md)
* [Event counter descriptions](docs/event-counters.md)
Bug Reports and Contributions
-----------------------------
@@ -363,13 +64,11 @@ Email bug reports and patches to Zygo Blaxell <bees@furryterror.org>.
You can also use Github:
https://github.com/Zygo/bees
https://github.com/Zygo/bees
Copyright & License
===================
-------------------
Copyright 2015-2016 Zygo Blaxell <bees@furryterror.org>.
Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.
GPL (version 3 or later).

1
docs/.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
*.html

18
docs/Makefile Normal file
View File

@@ -0,0 +1,18 @@
MARKDOWN := $(firstword $(shell command -v cmark-gfm redcarpet markdown2 markdown markdown_py 2>/dev/null || echo markdown))
# If you have cmark-gfm, you get Github-style tables; otherwise, you don't.
ifeq ($(notdir $(MARKDOWN)),cmark-gfm)
MARKDOWN += -e table
endif
.PHONY: docs
docs: $(subst .md,.html,$(wildcard *.md)) index.html ../README.html
%.html: %.md Makefile
$(MARKDOWN) $< | sed -e 's/\.md/\.html/g' > $@.new
mv -f $@.new $@
index.md: ../README.md
sed -e 's:docs/::g' < ../README.md > index.md.new
mv -f index.md.new index.md

1
docs/_config.yml Normal file
View File

@@ -0,0 +1 @@
theme: jekyll-theme-cayman

148
docs/btrfs-kernel.md Normal file
View File

@@ -0,0 +1,148 @@
Recommended Linux Kernel Version for bees
=========================================
First, a warning about old Linux kernel versions:
> **Linux kernel version 5.1, 5.2, and 5.3 should not be used with btrfs
due to a severe regression that can lead to fatal metadata corruption.**
This issue is fixed in version 5.4.14 and later.
**Recommended Linux kernel versions for bees are 5.4, 5.10, 5.15, 6.1,
6.6, or 6.12 with recent LTS and -stable updates.** The latest released
kernel as of this writing is 6.12.9, and the earliest supported LTS
kernel is 5.4.
Some optional bees features use kernel APIs introduced in kernel 4.15
(extent scan) and 5.6 (`openat2` support). These bees features are not
available on older kernels. Support for older kernels may be removed
in a future bees release.
bees will not run at all on kernels before 4.2 due to lack of minimal
API support.
Kernel Bug Tracking Table
-------------------------
These bugs are particularly popular among bees users, though not all are specifically relevant to bees:
| First bad kernel | Last bad kernel | Issue Description | Fixed Kernel Versions | Fix Commit
| :---: | :---: | --- | :---: | ---
| - | 4.10 | garbage inserted in read data when reading compressed inline extent followed by a hole | 3.18.89, 4.1.49, 4.4.107, 4.9.71, 4.11 and later | e1699d2d7bf6 btrfs: add missing memset while reading compressed inline extents
| - | 4.14 | spurious warnings from `fs/btrfs/backref.c` in `find_parent_nodes` | 3.16.57, 4.14.29, 4.15.12, 4.16 and later | c8195a7b1ad5 btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes
| 4.15 | 4.18 | compression ratio and performance regression on bees test corpus | improved in 4.19 | 4.14 performance not fully restored yet
| - | 5.0 | silently corrupted data returned when reading compressed extents around a punched hole (bees dedupes all-zero data blocks with holes which can produce a similar effect to hole punching) | 3.16.70, 3.18.137, 4.4.177, 4.9.165, 4.14.108, 4.19.31, 5.0.4, 5.1 and later | 8e928218780e Btrfs: fix corruption reading shared and compressed extents after hole punching
| - | 5.0 | deadlock when dedupe and rename are used simultaneously on the same files | 5.0.4, 5.1 and later | 4ea748e1d2c9 Btrfs: fix deadlock between clone/dedupe and rename
| - | 5.1 | send failure or kernel crash while running send and dedupe on same snapshot at same time | 5.0.18, 5.1.4, 5.2 and later | 62d54f3a7fa2 Btrfs: fix race between send and deduplication that lead to failures and crashes
| - | 5.2 | alternating send and dedupe results in incremental send failure | 4.9.188, 4.14.137, 4.19.65, 5.2.7, 5.3 and later | b4f9a1a87a48 Btrfs: fix incremental send failure after deduplication
| 4.20 | 5.3 | balance convert to single rejected with error on 32-bit CPUs | 5.3.7, 5.4 and later | 7a54789074a5 btrfs: fix balance convert to single on 32-bit host CPUs
| - | 5.3 | kernel crash due to tree mod log issue #1 (often triggered by bees) | 3.16.79, 4.4.195, 4.9.195, 4.14.147, 4.19.77, 5.2.19, 5.3.4, 5.4 and later | efad8a853ad2 Btrfs: fix use-after-free when using the tree modification log
| - | 5.4 | kernel crash due to tree mod log issue #2 (often triggered by bees) | 3.16.83, 4.4.208, 4.9.208, 4.14.161, 4.19.92, 5.4.7, 5.5 and later | 6609fee8897a Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues
| 5.1 | 5.4 | metadata corruption resulting in loss of filesystem when a write operation occurs while balance starts a new block group. **Do not use kernel 5.1 with btrfs.** Kernel 5.2 and 5.3 have workarounds that may detect corruption in progress and abort before it becomes permanent, but do not prevent corruption from occurring. Also kernel crash due to tree mod log issue #4. | 5.4.14, 5.5 and later | 6282675e6708 btrfs: relocation: fix reloc_root lifespan and access
| - | 5.4 | send performance failure when shared extents have too many references | 4.9.207, 4.14.159, 4.19.90, 5.3.17, 5.4.4, 5.5 and later | fd0ddbe25095 Btrfs: send, skip backreference walking for extents with many references
| 5.0 | 5.5 | dedupe fails to remove the last extent in a file if the file size is not a multiple of 4K | 5.4.19, 5.5.3, 5.6 and later | 831d2fa25ab8 Btrfs: make deduplication with range including the last block work
| 4.5, backported to 3.18.31, 4.1.22, 4.4.4 | 5.5 | `df` incorrectly reports 0 free space while data space is available. Triggered by changes in metadata size, including those typical of large-scale dedupe. Occurs more often starting in 5.3 and especially 5.4 | 4.4.213, 4.9.213, 4.14.170, 4.19.102, 5.4.18, 5.5.2, 5.6 and later | d55966c4279b btrfs: do not zero f_bavail if we have available space
| - | 5.5 | kernel crash due to tree mod log issue #3 (often triggered by bees) | 3.16.84, 4.4.214, 4.9.214, 4.14.171, 4.19.103, 5.4.19, 5.5.3, 5.6 and later | 7227ff4de55d Btrfs: fix race between adding and putting tree mod seq elements and nodes
| - | 5.6 | deadlock when enumerating file references to physical extent addresses while some references still exist in deleted subvols | 5.7 and later | 39dba8739c4e btrfs: do not resolve backrefs for roots that are being deleted
| - | 5.6 | deadlock when many extent reference updates are pending and available memory is low | 4.14.177, 4.19.116, 5.4.33, 5.5.18, 5.6.5, 5.7 and later | 351cbf6e4410 btrfs: use nofs allocations for running delayed items
| - | 5.6 | excessive CPU usage in `LOGICAL_INO` and `FIEMAP` ioctl and increased btrfs write latency in other processes when bees translates from extent physical address to list of referencing files and offsets. Also affects other tools like `duperemove` and `btrfs send` | 5.4.96, 5.7 and later | b25b0b871f20 btrfs: backref, use correct count to resolve normal data refs, plus 3 parent commits. Some improvements also in earlier kernels.
| - | 5.7 | filesystem becomes read-only if out of space while deleting snapshot | 4.9.238, 4.14.200, 4.19.149, 5.4.69, 5.8 and later | 7c09c03091ac btrfs: don't force read-only after error in drop snapshot
| 5.1 | 5.7 | balance, device delete, or filesystem shrink operations loop endlessly on a single block group without decreasing extent count | 5.4.54, 5.7.11, 5.8 and later | 1dae7e0e58b4 btrfs: reloc: clear DEAD\_RELOC\_TREE bit for orphan roots to prevent runaway balance
| - | 5.8 | deadlock in `TREE_SEARCH` ioctl (core component of bees filesystem scanner), followed by regression in deadlock fix | 4.4.237, 4.9.237, 4.14.199, 4.19.146, 5.4.66, 5.8.10 and later | a48b73eca4ce btrfs: fix potential deadlock in the search ioctl, 1c78544eaa46 btrfs: fix wrong address when faulting in pages in the search ioctl
| 5.7 | 5.10 | kernel crash if balance receives fatal signal e.g. Ctrl-C | 5.4.93, 5.10.11, 5.11 and later | 18d3bff411c8 btrfs: don't get an EINTR during drop_snapshot for reloc
| 5.10 | 5.10 | 20x write performance regression | 5.10.8, 5.11 and later | e076ab2a2ca7 btrfs: shrink delalloc pages instead of full inodes
| 5.4 | 5.11 | spurious tree checker failures on extent ref hash | 5.4.125, 5.10.43, 5.11.5, 5.12 and later | 1119a72e223f btrfs: tree-checker: do not error out if extent ref hash doesn't match
| - | 5.11 | tree mod log issue #5 | 4.4.263, 4.9.263, 4.14.227, 4.19.183, 5.4.108, 5.10.26, 5.11.9, 5.12 and later | dbcc7d57bffc btrfs: fix race when cloning extent buffer during rewind of an old root
| - | 5.12 | tree mod log issue #6 | 4.14.233, 4.19.191, 5.4.118, 5.10.36, 5.11.20, 5.12.3, 5.13 and later | f9690f426b21 btrfs: fix race when picking most recent mod log operation for an old root
| 5.11 | 5.12 | subvols marked for deletion with `btrfs sub del` become permanently undeletable ("ghost" subvols) | 5.12 stopped creation of new ghost subvols | Partially fixed in 8d488a8c7ba2 btrfs: fix subvolume/snapshot deletion not triggered on mount. Qu wrote a [patch](https://github.com/adam900710/linux/commit/9de990fcc8864c376eb28aa7482c54321f94acd4) to allow `btrfs sub del -i` to remove "ghost" subvols, but it was never merged upstream.
| 4.15 | 5.16 | spurious warnings from `fs/fs-writeback.c` when `flushoncommit` is enabled | 5.15.27, 5.16.13, 5.17 and later | a0f0cf8341e3 btrfs: get rid of warning on transaction commit when using flushoncommit
| - | 5.17 | crash during device removal can make filesystem unmountable | 5.15.54, 5.16.20, 5.17.3, 5.18 and later | bbac58698a55 btrfs: remove device item and update super block in the same transaction
| - | 5.18 | wrong superblock num_devices makes filesystem unmountable | 4.14.283, 4.19.247, 5.4.198, 5.10.121, 5.15.46, 5.17.14, 5.18.3, 5.19 and later | d201238ccd2f btrfs: repair super block num_devices automatically
| 5.18 | 5.19 | parent transid verify failed during log tree replay after a crash during a rename operation | 5.18.18, 5.19.2, 6.0 and later | 723df2bcc9e1 btrfs: join running log transaction when logging new name
| 5.12 | 6.0 | space cache corruption and potential double allocations | 5.15.65, 5.19.6, 6.0 and later | ced8ecf026fd btrfs: fix space cache corruption and potential double allocations
| 6.0 | 6.5 | suboptimal allocation in multi-device filesystems due to chunk allocator regression | 6.1.60, 6.5.9, 6.6 and later | 8a540e990d7d btrfs: fix stripe length calculation for non-zoned data chunk allocation
| 6.3, backported to 5.15.107, 6.1.24, 6.2.11 | 6.3 | vmalloc error, failed to allocate pages | 6.3.10, 6.4 and later. Bug (f349b15e183d "mm: vmalloc: avoid warn_alloc noise caused by fatal signal" in v6.3-rc6) backported to 6.1.24, 6.2.11, and 5.15.107. | 95a301eefa82 mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
| 6.2 | 6.3 | `IGNORE_OFFSET` flag ignored in `LOGICAL_INO` ioctl | 6.2.16, 6.3.3, 6.4 and later | 0cad8f14d70c btrfs: fix backref walking not returning all inode refs
| 6.10 | 6.11 | `adding refs to an existing tree ref`, `failed to run delayed ref`, then read-only | 6.11.10, 6.12 and later | 7d493a5ecc26 btrfs: fix incorrect comparison for delayed refs
| 5.4 | - | kernel hang when multiple threads are running `LOGICAL_INO` and dedupe/clone ioctl on the same extent | - | workaround: avoid doing that
"Last bad kernel" refers to that version's last stable update from
kernel.org. Distro kernels may backport additional fixes. Consult
your distro's kernel support for details.
When the same version appears in both "last bad kernel" and "fixed kernel
version" columns, it means the bug appears in the `.0` release and is
fixed in the stated `.y` release. e.g. a "last bad kernel" of 5.4 and
a "fixed kernel version" of 5.4.14 has the bug in kernel versions 5.4.0
through 5.4.13 inclusive.
A "-" for "first bad kernel" indicates the bug has been present since
the relevant feature first appeared in btrfs.
A "-" for "last bad kernel" indicates the bug has not yet been fixed in
current kernels (see top of this page for which kernel version that is).
In cases where issues are fixed by commits spread out over multiple
kernel versions, "fixed kernel version" refers to the version that
contains the last committed component of the fix.
Workarounds for known kernel bugs
---------------------------------
* **Hangs with concurrent `LOGICAL_INO` and dedupe/clone**: on all
kernel versions so far, multiple threads running `LOGICAL_INO` and
dedupe/clone ioctls at the same time on the same inodes or extents
can lead to a kernel hang. The kernel enters an infinite loop in
`add_all_parents`, where `count` is 0, `ref->count` is 1, and
`btrfs_next_item` or `btrfs_next_old_item` never find a matching ref.
bees has two workarounds for this bug: 1. schedule work so that multiple
threads do not simultaneously access the same inode or the same extent,
and 2. use a brute-force global lock within bees that prevents any
thread from running `LOGICAL_INO` while any other thread is running
dedupe.
Workaround #1 isn't really a workaround, since we want to do the same
thing for unrelated performance reasons. If multiple threads try to
perform dedupe operations on the same extent or inode, btrfs will make
all the threads wait for the same locks anyway, so it's better to have
bees find some other inode or extent to work on while waiting for btrfs
to finish.
Workaround #2 doesn't seem to be needed after implementing workaround
#1, but it's better to be slightly slower than to hang one CPU core
and the filesystem until the kernel is rebooted.
It is still theoretically possible to trigger the kernel bug when
running bees at the same time as other dedupers, or other programs
that use `LOGICAL_INO` like `btdu`, or when performing a reflink clone
operation such as `cp` or `mv`; however, it's extremely difficult to
reproduce the bug without closely cooperating threads.
* **Slow backrefs** (aka toxic extents): On older kernels, under certain
conditions, if the number of references to a single shared extent grows
too high, the kernel consumes more and more CPU while also holding
locks that delay write access to the filesystem. This is no longer
a concern on kernels after 5.7 (or an up-to-date 5.4 LTS version),
but there are still some remains of earlier workarounds for this issue
in bees that have not been fully removed.
bees avoided this bug by measuring the time the kernel spends performing
`LOGICAL_INO` operations and permanently blacklisting any extent or
hash involved where the kernel starts to get slow. In the bees log,
such blocks are labelled as 'toxic' hash/block addresses.
Future bees releases will remove toxic extent detection (it only detects
false positives now) and clear all previously saved toxic extent bits.
* **dedupe breaks `btrfs send` in old kernels**. The bees option
`--workaround-btrfs-send` prevents any modification of read-only subvols
in order to avoid breaking `btrfs send` on kernels before 5.2.
This workaround is no longer necessary to avoid kernel crashes and
send performance failure on kernel 5.4.4 and later. bees will pause
dedupe until the send is finished on current kernels.
`btrfs receive` is not and has never been affected by this issue.

44
docs/btrfs-other.md Normal file
View File

@@ -0,0 +1,44 @@
Good Btrfs Feature Interactions
-------------------------------
bees has been tested in combination with the following:
* btrfs compression (zlib, lzo, zstd)
* PREALLOC extents (unconditionally replaced with holes)
* HOLE extents and btrfs no-holes feature
* Other deduplicators (`duperemove`, `jdupes`)
* Reflink copies (modern coreutils `cp` and `mv`)
* Concurrent file modification (e.g. PostgreSQL and sqlite databases, VMs, build daemons)
* All btrfs RAID profiles: single, dup, raid0, raid1, raid10, raid1c3, raid1c4, raid5, raid6
* IO errors during dedupe (affected extents are skipped)
* 4K filesystem data block size / clone alignment
* 64-bit and 32-bit LE host CPUs (amd64, x86, arm)
* Large files (kernel 5.4 or later strongly recommended)
* Filesystem data sizes up to 100T+ bytes, 1000M+ files
* `open(O_DIRECT)` (seems to work as well--or as poorly--with bees as with any other btrfs feature)
* btrfs-convert from ext2/3/4
* btrfs `autodefrag` mount option
* btrfs balance (data balances cause rescan of relocated data)
* btrfs block-group-tree
* btrfs `flushoncommit` and `noflushoncommit` mount options
* btrfs mixed block groups
* btrfs `nodatacow`/`nodatasum` inode attribute or mount option (bees skips all nodatasum files)
* btrfs qgroups and quota support (_not_ squotas)
* btrfs receive
* btrfs scrub
* btrfs send (dedupe pauses automatically, kernel 5.4 or later required)
* btrfs snapshot, non-snapshot subvols (RW and RO), snapshot delete
**Note:** some btrfs features have minimum kernel versions which are
higher than the minimum kernel version for bees.
Untested Btrfs Feature Interactions
-----------------------------------
bees has not been tested with the following, and undesirable interactions may occur:
* Non-4K filesystem data block size (should work if recompiled)
* Non-equal hash (SUM) and filesystem data block (CLONE) sizes (need to fix that eventually)
* btrfs seed filesystems, raid-stripe-tree, squotas (no particular reason these wouldn't work, but no one has reported trying)
* btrfs out-of-tree kernel patches (e.g. encryption, extent tree v2)
* Host CPUs with exotic page sizes, alignment requirements, or endianness (ppc, alpha, sparc, strongarm, s390, mips, m68k...)

317
docs/config.md Normal file
View File

@@ -0,0 +1,317 @@
bees Configuration
==================
The only configuration parameter that *must* be provided is the hash
table size. Other parameters are optional or hardcoded, and the defaults
are reasonable in most cases.
Hash Table Sizing
-----------------
Hash table entries are 16 bytes per data block. The hash table stores the
most recently read unique hashes. Once the hash table is full, each new
entry added to the table evicts an old entry. This makes the hash table
a sliding window over the most recently scanned data from the filesystem.
Here are some numbers to estimate appropriate hash table sizes:
unique data size | hash table size |average dedupe extent size
1TB | 4GB | 4K
1TB | 1GB | 16K
1TB | 256MB | 64K
1TB | 128MB | 128K <- recommended
1TB | 16MB | 1024K
64TB | 1GB | 1024K
Notes:
* If the hash table is too large, no extra dedupe efficiency is
obtained, and the extra space wastes RAM.
* If the hash table is too small, bees extrapolates from matching
blocks to find matching adjacent blocks in the filesystem that have been
evicted from the hash table. In other words, bees only needs to find
one block in common between two extents in order to be able to dedupe
the entire extents. This provides significantly more dedupe hit rate
per hash table byte than other dedupe tools.
* There is a fairly wide range of usable hash sizes, and performances
degrades according to a smooth probabilistic curve in both directions.
Double or half the optimium size usually works just as well.
* When counting unique data in compressed data blocks to estimate
optimum hash table size, count the *uncompressed* size of the data.
* Another way to approach the hash table size is to simply decide how much
RAM can be spared without too much discomfort, give bees that amount of
RAM, and accept whatever dedupe hit rate occurs as a result. bees will
do the best job it can with the RAM it is given.
Factors affecting optimal hash table size
-----------------------------------------
It is difficult to predict the net effect of data layout and access
patterns on dedupe effectiveness without performing deep inspection of
both the filesystem data and its structure--a task that is as expensive
as performing the deduplication.
* **Compression** in files reduces the average extent length compared
to uncompressed files. The maximum compressed extent length on
btrfs is 128KB, while the maximum uncompressed extent length is 128MB.
Longer extents decrease the optimum hash table size while shorter extents
increase the optimum hash table size, because the probability of a hash
table entry being present (i.e. unevicted) in each extent is proportional
to the extent length.
As a rule of thumb, the optimal hash table size for a compressed
filesystem is 2-4x larger than the optimal hash table size for the same
data on an uncompressed filesystem. Dedupe efficiency falls rapidly with
hash tables smaller than 128MB/TB as the average dedupe extent size is
larger than the largest possible compressed extent size (128KB).
* **Short writes or fragmentation** also shorten the average extent
length and increase optimum hash table size. If a database writes to
files randomly using 4K page writes, all of these extents will be 4K
in length, and the hash table size must be increased to retain each one
(or the user must accept a lower dedupe hit rate).
Defragmenting files that have had many short writes increases the
extent length and therefore reduces the optimum hash table size.
* **Time between duplicate writes** also affects the optimum hash table
size. bees reads data blocks in logical order during its first pass,
and after that new data blocks are read incrementally a few seconds or
minutes after they are written. bees finds more matching blocks if there
is a smaller amount of data between the matching reads, i.e. there are
fewer blocks evicted from the hash table. If most identical writes to
the filesystem occur near the same time, the optimum hash table size is
smaller. If most identical writes occur over longer intervals of time,
the optimum hash table size must be larger to avoid evicting hashes from
the table before matches are found.
For example, a build server normally writes out very similar source
code files over and over, so it will need a smaller hash table than a
backup server which has to refer to the oldest data on the filesystem
every time a new client machine's data is added to the server.
Scanning modes
--------------
The `--scan-mode` option affects how bees iterates over the filesystem,
schedules extents for scanning, and tracks progress.
There are now two kinds of scan mode: the legacy **subvol** scan modes,
and the new **extent** scan mode.
Scan mode can be changed by restarting bees with a different scan mode
option.
Extent scan mode:
* Works with 4.15 and later kernels.
* Can estimate progress and provide an ETA.
* Can optimize scanning order to dedupe large extents first.
* Can keep up with frequent creation and deletion of snapshots.
Subvol scan modes:
* Work with 4.14 and earlier kernels.
* Cannot estimate or report progress.
* Cannot optimize scanning order by extent size.
* Have problems keeping up with multiple snapshots created during a scan.
The default scan mode is 4, "extent".
If you are using bees for the first time on a filesystem with many
existing snapshots, you should read about [snapshot gotchas](gotchas.md).
Subvol scan modes
-----------------
Subvol scan modes are maintained for compatibility with existing
installations, but will not be developed further. New installations
should use extent scan mode instead.
The _quantity_ of text below detailing the shortcomings of each subvol
scan mode should be informative all by itself.
Subvol scan modes work on any kernel version supported by bees. They
are the only scan modes usable on kernel 4.14 and earlier.
The difference between the subvol scan modes is the order in which the
files from different subvols are fed into the scanner. They all scan
files in inode number order, from low to high offset within each inode,
the same way that a program like `cat` would read files (but skipping
over old data from earlier btrfs transactions).
If a filesystem has only one subvolume with data in it, then all of
the subvol scan modes are equivalent. In this case, there is only one
subvolume to scan, so every possible ordering of subvols is the same.
The `--workaround-btrfs-send` option pauses scanning subvols that are
read-only. If the subvol is made read-write (e.g. with `btrfs prop set
$subvol ro false`), or if the `--workaround-btrfs-send` option is removed,
then the scan of that subvol is unpaused and dedupe proceeds normally.
Space will only be recovered when the last read-only subvol is deleted.
Subvol scan modes cannot efficiently or accurately calculate an ETA for
completion or estimate progress through the data. They simply request
"the next new inode" from btrfs, and they are completed when btrfs says
there is no next new inode.
Between subvols, there are several scheduling algorithms with different
trade-offs:
Scan mode 0, "lockstep", scans the same inode number in each subvol at
close to the same time. This is useful if the subvols are snapshots
with a common ancestor, since the same inode number in each subvol will
have similar or identical contents. This maximizes the likelihood that
all of the references to a snapshot of a file are scanned at close to
the same time, improving dedupe hit rate. If the subvols are unrelated
(i.e. not snapshots of a single subvol) then this mode does not provide
any significant advantage. This mode uses smaller amounts of temporary
space for shorter periods of time when most subvols are snapshots. When a
new snapshot is created, this mode will stop scanning other subvols and
scan the new snapshot until the same inode number is reached in each
subvol, which will effectively stop dedupe temporarily as this data has
already been scanned and deduped in the other snapshots.
Scan mode 1, "independent", scans the next inode with new data in
each subvol. There is no coordination between the subvols, other than
round-robin distribution of files from each subvol to each worker thread.
This mode makes continuous forward progress in all subvols. When a new
snapshot is created, previous subvol scans continue as before, but the
worker threads are now divided among one more subvol.
Scan mode 2, "sequential", scans one subvol at a time, in numerical subvol
ID order, processing each subvol completely before proceeding to the next
subvol. This avoids spending time scanning short-lived snapshots that
will be deleted before they can be fully deduped (e.g. those used for
`btrfs send`). Scanning starts on older subvols that are more likely
to be origin subvols for future snapshots, eliminating the need to
dedupe future snapshots separately. This mode uses the largest amount
of temporary space for the longest time, and typically requires a larger
hash table to maintain dedupe hit rate.
Scan mode 3, "recent", scans the subvols with the highest `min_transid`
value first (i.e. the ones that were most recently completely scanned),
then falls back to "independent" mode to break ties. This interrupts
long scans of old subvols to give a rapid dedupe response to new data
in previously scanned subvols, then returns to the old subvols after
the new data is scanned.
Extent scan mode
----------------
Scan mode 4, "extent", scans the extent tree instead of the subvol trees.
Extent scan mode reads each extent once, regardless of the number of
reflinks or snapshots. It adapts to the creation of new snapshots
and reflinks immediately, without having to revisit old data.
In the extent scan mode, extents are separated into multiple size tiers
to prioritize large extents over small ones. Deduping large extents
keeps the metadata update cost low per block saved, resulting in faster
dedupe at the start of a scan cycle. This is important for maximizing
performance in use cases where bees runs for a limited time, such as
during an overnight maintenance window.
Once the larger size tiers are completed, dedupe space recovery speeds
slow down significantly. It may be desirable to stop bees running once
the larger size tiers are finished, then start bees running some time
later after new data has appeared.
Each extent is mapped in physical address order, and all extent references
are submitted to the scanner at the same time, resulting in much better
cache behavior and dedupe performance compared to the subvol scan modes.
The "extent" scan mode is not usable on kernels before 4.15 because
it relies on the `LOGICAL_INO_V2` ioctl added in that kernel release.
When using bees with an older kernel, only subvol scan modes will work.
Extents are divided into virtual subvols by size, using reserved btrfs
subvol IDs 250..255. The size tier groups are:
* 250: 32M+1 and larger
* 251: 8M+1..32M
* 252: 2M+1..8M
* 253: 512K+1..2M
* 254: 128K+1..512K
* 255: 128K and smaller (includes all compressed extents)
Extent scan mode can efficiently calculate dedupe progress within
the filesystem and estimate an ETA for completion within each size
tier; however, the accuracy of the ETA can be questionable due to the
non-uniform distribution of block addresses in a typical user filesystem.
Older versions of bees do not recognize the virtual subvols, so running
an old bees version after running a new bees version will reset the
"extent" scan mode's progress in `beescrawl.dat` to the beginning.
This may change in future bees releases, i.e. extent scans will store
their checkpoint data somewhere else.
The `--workaround-btrfs-send` option behaves differently in extent
scan modes: In extent scan mode, dedupe proceeds on all subvols that are
read-write, but all subvols that are read-only are excluded from dedupe.
Space will only be recovered when the last read-only subvol is deleted.
During `btrfs send` all duplicate extents in the sent subvol will not be
removed (the kernel will reject dedupe commands while send is active,
and bees currently will not re-issue them after the send is complete).
It may be preferable to terminate the bees process while running `btrfs
send` in extent scan mode, and restart bees after the `send` is complete.
Threads and load management
---------------------------
By default, bees creates one worker thread for each CPU detected. These
threads then perform scanning and dedupe operations. bees attempts to
maximize the amount of productive work each thread does, until either the
threads are all continuously busy, or there is no remaining work to do.
In many cases it is not desirable to continually run bees at maximum
performance. Maximum performance is not necessary if bees can dedupe
new data faster than it appears on the filesystem. If it only takes
bees 10 minutes per day to dedupe all new data on a filesystem, then
bees doesn't need to run for more than 10 minutes per day.
bees supports a number of options for reducing system load:
* Run bees for a few hours per day, at an off-peak time (i.e. during
a maintenace window), instead of running bees continuously. Any data
added to the filesystem while bees is not running will be scanned when
bees restarts. At the end of the maintenance window, terminate the
bees process with SIGTERM to write the hash table and scan position
for the next maintenance window.
* Temporarily pause bees operation by sending the bees process SIGUSR1,
and resume operation with SIGUSR2. This is preferable to freezing
and thawing the process, e.g. with freezer cgroups or SIGSTOP/SIGCONT
signals, because it allows bees to close open file handles that would
otherwise prevent those files from being deleted while bees is frozen.
* Reduce the number of worker threads with the [`--thread-count` or
`--thread-factor` options](options.md). This simply leaves CPU cores
idle so that other applications on the host can use them, or to save
power.
* Allow bees to automatically track system load and increase or decrease
the number of threads to reach a target system load. This reduces
impact on the rest of the system by pausing bees when other CPU and IO
intensive loads are active on the system, and resumes bees when the other
loads are inactive. This is configured with the [`--loadavg-target`
and `--thread-min` options](options.md).
* Allow bees to self-throttle operations that enqueue delayed work
within btrfs. These operations are not well controlled by Linux
features such as process priority or IO priority or IO rate-limiting,
because the enqueued work is submitted to btrfs several seconds before
btrfs performs the work. By the time btrfs performs the work, it's too
late for external throttling to be effective. The [`--throttle-factor`
option](options.md) tracks how long it takes btrfs to complete queued
operations, and reduces bees's queued work submission rate to match
btrfs's queued work completion rate (or a fraction thereof, to reduce
system load).
Log verbosity
-------------
bees can be made less chatty with the [`--verbose` option](options.md).

435
docs/event-counters.md Normal file
View File

@@ -0,0 +1,435 @@
Event Counters
==============
General
-------
Event counters are used in bees to collect simple branch-coverage
statistics. Every time bees makes a decision, it increments an event
counter, so there are _many_ event counters.
Events are grouped by prefix in their event names, e.g. `block` is block
I/O, `dedup` is deduplication requests, `tmp` is temporary files, etc.
Events with the suffix `_ms` count total milliseconds spent performing
the operation. These are counted separately for each thread, so there
can be more than 1000 ms per second.
There is considerable overlap between some events, e.g. `example_try`
denotes an event that is counted when an action is attempted,
`example_hit` is counted when the attempt succeeds and has a desired
outcome, and `example_miss` is counted when the attempt succeeds but
the desired outcome is not achieved. In most cases `example_try =
example_hit + example_miss + (`example failed and threw an exception`)`,
but some event groups defy such simplistic equations.
addr
----
The `addr` event group consists of operations related to translating `(root,
inode, offset)` tuples (i.e. logical position within a file) into btrfs
virtual block addresses (i.e. physical position on disk).
* `addr_block`: The address of a block was computed.
* `addr_compressed`: Obsolete implementation of `addr_compressed_offset`.
* `addr_compressed_offset`: The address of a compressed block was computed.
* `addr_delalloc`: The address of a block could not be computed due to
delayed allocation. Only possible when using obsolete `FIEMAP` code.
* `addr_eof_e`: The address of a block at EOF that was not block-aligned was computed.
* `addr_from_fd`: The address of a block was computed using a `fd`
(open to the file in question) and `offset` pair.
* `addr_from_root_fd`: The address of a block was computed using
the filesystem root `fd` instead of the open file `fd` for the
`TREE_SEARCH_V2` ioctl. This is obsolete and should probably be removed
at some point.
* `addr_hole`: The address of a block in a hole was computed.
* `addr_magic`: The address of a block cannot be determined in a way
that bees can use (unrecognized flags or flags known to be incompatible
with bees).
* `addr_uncompressed`: The address of an uncompressed block was computed.
* `addr_unrecognized`: The address of a block with unrecognized flags
(i.e. kernel version newer than bees) was computed.
* `addr_unusable`: The address of a block with unusable flags (i.e. flags
that are known to be incompatible with bees) was computed.
adjust
------
The `adjust` event group consists of operations related to translating stored virtual block addresses (i.e. physical position on disk) to `(root, inode, offset)` tuples (i.e. logical positions within files). `BeesResolver::adjust_offset` determines if a single candidate reference from the `LOGICAL_INO` ioctl corresponds to the requested btrfs virtual block address.
* `adjust_compressed_offset_correct`: A block address corresponding to a compressed block was retrieved from the hash table and resolved to a physical block containing data that matches another block bees has already read.
* `adjust_compressed_offset_wrong`: A block address corresponding to a compressed block was retrieved from the hash table and resolved to a physical block containing data that matches the hash but not the data from another block bees has already read (i.e. there was a hash collision).
* `adjust_eof_fail`: A block address corresponding to a block at EOF that was not aligned to a block boundary matched another block bees already read, but the length of the unaligned data in both blocks was not equal. This is usually caused by stale entries in the hash table pointing to blocks that have been overwritten since the hash table entries were created. It can also be caused by hash collisions, but hashes are not yet computed at this point in the code, so this event does not correlate to the `hash_collision` counter.
* `adjust_eof_haystack`: A block address from the hash table corresponding to a block at EOF that was not aligned to a block boundary was processed.
* `adjust_eof_hit`: A block address corresponding to a block at EOF that was not aligned to a block boundary matched a similarly unaligned block that bees already read.
* `adjust_eof_miss`: A block address from the hash table corresponding to a block at EOF that was not aligned to a block boundary did not match a similarly unaligned block that bees already read.
* `adjust_eof_needle`: A block address from scanning the disk corresponding to a block at EOF that was not aligned to a block boundary was processed.
* `adjust_exact`: A block address from the hash table corresponding to an uncompressed data block was processed to find its `(root, inode, offset)` references.
* `adjust_exact_correct`: A block address corresponding to an uncompressed block was retrieved from the hash table and resolved to a physical block containing data that matches another block bees has already read.
* `adjust_exact_wrong`: A block address corresponding to an uncompressed block was retrieved from the hash table and resolved to a physical block containing data that matches the hash but not the data from another block bees has already read (i.e. there was a hash collision).
* `adjust_hit`: A block address was retrieved from the hash table and resolved to a physical block in an uncompressed extent containing data that matches the data from another block bees has already read (i.e. a duplicate match was found).
* `adjust_miss`: A block address was retrieved from the hash table and resolved to a physical block containing a hash that does not match the hash from another block bees has already read (i.e. the hash table contained a stale entry and the data it referred to has since been overwritten in the filesystem).
* `adjust_needle_too_long`: A block address was retrieved from the hash table, but when the corresponding extent item was retrieved, its offset or length were out of range to be a match (i.e. the hash table contained a stale entry and the data it referred to has since been overwritten in the filesystem).
* `adjust_no_match`: A hash collision occurred (i.e. a block on disk was located with the same hash as the hash table entry but different data) . Effectively an alias for `hash_collision` as it is not possible to have one event without the other.
* `adjust_offset_high`: The `LOGICAL_INO` ioctl gave an extent item that does not overlap with the desired block because the extent item ends before the desired block in the extent data.
* `adjust_offset_hit`: A block address was retrieved from the hash table and resolved to a physical block in a compressed extent containing data that matches the data from another block bees has already read (i.e. a duplicate match was found).
* `adjust_offset_low`: The `LOGICAL_INO` ioctl gave an extent item that does not overlap with the desired block because the extent item begins after the desired block in the extent data.
* `adjust_try`: A block address and extent item candidate were passed to `BeesResolver::adjust_offset` for processing.
block
-----
The `block` event group consists of operations related to reading data blocks from the filesystem.
* `block_bytes`: Number of data bytes read.
* `block_hash`: Number of block hashes computed.
* `block_ms`: Total time reading data blocks.
* `block_read`: Number of data blocks read.
* `block_zero`: Number of data blocks read with zero contents (i.e. candidates for replacement with a hole).
bug
---
The `bug` event group consists of known bugs in bees.
* `bug_bad_max_transid`: A bad `max_transid` was found and removed in `beescrawl.dat`.
* `bug_bad_min_transid`: A bad `min_transid` was found and removed in `beescrawl.dat`.
* `bug_dedup_same_physical`: `BeesContext::dedup` detected that the physical extent was the same for `src` and `dst`. This has no effect on space usage so it is a waste of time, and also carries the risk of creating a toxic extent.
* `bug_grow_pair_overlaps`: Two identical blocks were found, and while searching matching adjacent extents, the potential `src` grew to overlap the potential `dst`. This would create a cycle where bees keeps trying to eliminate blocks but instead just moves them around.
* `bug_hash_duplicate_cell`: Two entries in the hash table were identical. This only happens due to data corruption or a bug.
* `bug_hash_magic_addr`: An entry in the hash table contains an address with magic. Magic addresses cannot be deduplicated so they should not be stored in the hash table.
chase
-----
The `chase` event group consists of operations connecting btrfs virtual block addresses with `(root, inode, offset)` tuples. `resolve` is the top level, `adjust` is the bottom level, and `chase` is the middle level. `BeesResolver::chase_extent_ref` iterates over `(root, inode, offset)` tuples from `LOGICAL_INO` and attempts to find a single matching block in the filesystem given a candidate block from an earlier `scan` operation.
* `chase_corrected`: A matching block was resolved to a `(root, inode, offset)` tuple, but the offset of a block matching data did not match the offset given by `LOGICAL_INO`.
* `chase_hit`: A block address was successfully and correctly translated to a `(root, inode, offset)` tuple.
* `chase_no_data`: A block address was not successfully translated to a `(root, inode, offset)` tuple.
* `chase_no_fd`: A `(root, inode)` tuple could not be opened (i.e. the file was deleted on the filesystem).
* `chase_try`: A block address translation attempt started.
* `chase_uncorrected`: A matching block was resolved to a `(root, inode, offset)` tuple, and the offset of a block matching data did match the offset given by `LOGICAL_INO`.
* `chase_wrong_addr`: The btrfs virtual address (i.e. physical block address) found at a candidate `(root, inode, offset)` tuple did not match the expected btrfs virtual address (i.e. the filesystem was modified during the resolve operation).
* `chase_wrong_magic`: The extent item at a candidate `(root, inode, offset)` tuple has magic bits and cannot match any btrfs virtual address in the hash table (i.e. the filesystem was modified during the resolve operation).
crawl
-----
The `crawl` event group consists of operations related to scanning btrfs trees to find new extent refs to scan for dedupe.
* `crawl_again`: An inode crawl was restarted because the extent was already locked by another running crawl.
* `crawl_blacklisted`: An extent was not scanned because it belongs to a blacklisted file.
* `crawl_deferred_inode`: Two tasks attempted to scan the same inode at the same time, so one was deferred.
* `crawl_done`: One pass over a subvol was completed.
* `crawl_discard_high`: An extent that was too large for the crawler's size tier was discarded.
* `crawl_discard_low`: An extent that was too small for the crawler's size tier was discarded.
* `crawl_empty`: A `TREE_SEARCH_V2` ioctl call failed or returned an empty set (usually because all data in the subvol was scanned).
* `crawl_extent`: The extent crawler queued all references to an extent for processing.
* `crawl_fail`: A `TREE_SEARCH_V2` ioctl call failed.
* `crawl_flop`: Small extent items were not skipped because the next extent started at or before the end of the previous extent.
* `crawl_gen_high`: An extent item in the search results refers to an extent that is newer than the current crawl's `max_transid` allows.
* `crawl_gen_low`: An extent item in the search results refers to an extent that is older than the current crawl's `min_transid` allows.
* `crawl_hole`: An extent item in the search results refers to a hole.
* `crawl_inline`: An extent item in the search results contains an inline extent.
* `crawl_items`: An item in the `TREE_SEARCH_V2` data was processed.
* `crawl_ms`: Time spent running the `TREE_SEARCH_V2` ioctl.
* `crawl_no_empty`: Attempted to delete the last crawler. Should never happen.
* `crawl_nondata`: An item in the search results is not data.
* `crawl_prealloc`: An extent item in the search results refers to a `PREALLOC` extent.
* `crawl_push`: An extent item in the search results is suitable for scanning and deduplication.
* `crawl_scan`: An extent item in the search results is submitted to `BeesContext::scan_forward` for scanning and deduplication.
* `crawl_skip`: Small extent items were skipped because no extent of sufficient size was found within the minimum search distance.
* `crawl_skip_ms`: Time spent skipping small extent items.
* `crawl_search`: A `TREE_SEARCH_V2` ioctl call was successful.
* `crawl_throttled`: Extent scan created too many work queue items and was prevented from creating any more.
* `crawl_tree_block`: Extent scan found and skipped a metadata tree block.
* `crawl_unknown`: An extent item in the search results has an unrecognized type.
* `crawl_unthrottled`: Extent scan allowed to create work queue items again.
dedup
-----
The `dedup` (sic) event group consists of operations that deduplicate data.
* `dedup_bytes`: Total bytes in extent references deduplicated.
* `dedup_copy`: Total bytes copied to eliminate unique data in extents containing a mix of unique and duplicate data.
* `dedup_hit`: Total number of pairs of identical extent references.
* `dedup_miss`: Total number of pairs of non-identical extent references.
* `dedup_ms`: Total time spent running the `FILE_EXTENT_SAME` (aka `FI_DEDUPERANGE` or `dedupe_file_range`) ioctl.
* `dedup_prealloc_bytes`: Total bytes in eliminated `PREALLOC` extent references.
* `dedup_prealloc_hit`: Total number of successfully eliminated `PREALLOC` extent references.
* `dedup_prealloc_hit`: Total number of unsuccessfully eliminated `PREALLOC` extent references (i.e. filesystem data changed between scan and dedupe).
* `dedup_try`: Total number of pairs of extent references submitted for deduplication.
* `dedup_workaround_btrfs_send`: Total number of extent reference pairs submitted for deduplication that were discarded to workaround `btrfs send` bugs.
exception
---------
The `exception` event group consists of C++ exceptions. C++ exceptions are thrown due to IO errors and internal constraint check failures.
* `exception_caught`: Total number of C++ exceptions thrown and caught by a generic exception handler.
* `exception_caught_silent`: Total number of "silent" C++ exceptions thrown and caught by a generic exception handler. These are exceptions which are part of the correct and normal operation of bees. The exceptions are logged at a lower log level.
extent
------
The `extent` event group consists of events that occur within the extent scanner.
* `extent_deferred_inode`: A lock conflict was detected when two worker threads attempted to manipulate the same inode at the same time.
* `extent_empty`: A complete list of references to an extent was created but the list was empty, e.g. because all refs are in deleted inodes or snapshots.
* `extent_fail`: An ioctl call to `LOGICAL_INO` failed.
* `extent_forward`: An extent reference was submitted for scanning.
* `extent_mapped`: A complete map of references to an extent was created and added to the crawl queue.
* `extent_ok`: An ioctl call to `LOGICAL_INO` completed successfully.
* `extent_overflow`: A complete map of references to an extent exceeded `BEES_MAX_EXTENT_REF_COUNT`, so the extent was dropped.
* `extent_ref_missing`: An extent reference reported by `LOGICAL_INO` was not found by later `TREE_SEARCH_V2` calls.
* `extent_ref_ok`: One extent reference was queued for scanning.
* `extent_restart`: An extent reference was requeued to be scanned again after an active extent lock is released.
* `extent_retry`: An extent reference was requeued to be scanned again after an active inode lock is released.
* `extent_skip`: A 4K extent with more than 1000 refs was skipped.
* `extent_zero`: An ioctl call to `LOGICAL_INO` succeeded, but reported an empty list of extents.
hash
----
The `hash` event group consists of operations related to the bees hash table.
* `hash_already`: A `(hash, address)` pair was already present in the hash table during a `BeesHashTable::push_random_hash_addr` operation.
* `hash_bump`: An existing `(hash, address)` pair was moved forward in the hash table by a `BeesHashTable::push_random_hash_addr` operation.
* `hash_collision`: A pair of data blocks was found with identical hashes but different data.
* `hash_erase`: A `(hash, address)` pair in the hash table was removed because a matching data block could not be found in the filesystem (i.e. the hash table entry is out of date).
* `hash_erase_miss`: A `(hash, address)` pair was reported missing from the filesystem but no such entry was found in the hash table (i.e. race between scanning threads or pair already evicted).
* `hash_evict`: A `(hash, address)` pair was evicted from the hash table to accommodate a new hash table entry.
* `hash_extent_in`: A hash table extent was read.
* `hash_extent_out`: A hash table extent was written.
* `hash_front`: A `(hash, address)` pair was pushed to the front of the list because it matched a duplicate block.
* `hash_front_already`: A `(hash, address)` pair was pushed to the front of the list because it matched a duplicate block, but the pair was already at the front of the list so no change occurred.
* `hash_insert`: A `(hash, address)` pair was inserted by `BeesHashTable::push_random_hash_addr`.
* `hash_lookup`: The hash table was searched for `(hash, address)` pairs matching a given `hash`.
open
----
The `open` event group consists of operations related to translating `(root, inode)` tuples into open file descriptors (i.e. `open_by_handle` emulation for btrfs).
* `open_clear`: The open FD cache was cleared to avoid keeping file descriptors open too long.
* `open_fail_enoent`: A file could not be opened because it no longer exists (i.e. it was deleted or renamed during the lookup/resolve operations).
* `open_fail_error`: A file could not be opened for other reasons (e.g. IO error, permission denied, out of resources).
* `open_file`: A file was successfully opened. This counts only the `open()` system call, not other reasons why the opened FD might not be usable.
* `open_hit`: A file was successfully opened and the FD was acceptable.
* `open_ino_ms`: Total time spent executing the `open()` system call.
* `open_lookup_empty`: No paths were found for the inode in the `INO_PATHS` ioctl.
* `open_lookup_enoent`: The `INO_PATHS` ioctl returned ENOENT.
* `open_lookup_error`: The `INO_PATHS` ioctl returned a different error.
* `open_lookup_ok`: The `INO_PATHS` ioctl successfully returned a list of one or more filenames.
* `open_no_path`: All attempts to open a file by `(root, inode)` pair failed.
* `open_no_root`: An attempt to open a file by `(root, inode)` pair failed because the `root` could not be opened.
* `open_root_ms`: Total time spent opening subvol root FDs.
* `open_wrong_dev`: A FD returned by `open()` did not match the device belonging to the filesystem subvol.
* `open_wrong_flags`: A FD returned by `open()` had incompatible flags (`NODATASUM` / `NODATACOW`).
* `open_wrong_ino`: A FD returned by `open()` did not match the expected inode (i.e. the file was renamed or replaced during the lookup/resolve operations).
* `open_wrong_root`: A FD returned by `open()` did not match the expected subvol ID (i.e. `root`).
pairbackward
------------
The `pairbackward` event group consists of events related to extending matching block ranges backward starting from the initial block match found using the hash table.
* `pairbackward_bof_first`: A matching pair of block ranges could not be extended backward because the beginning of the first (src) file was reached.
* `pairbackward_bof_second`: A matching pair of block ranges could not be extended backward because the beginning of the second (dst) file was reached.
* `pairbackward_hit`: A pair of matching block ranges was extended backward by one block.
* `pairbackward_miss`: A pair of matching block ranges could not be extended backward by one block because the pair of blocks before the first block in the range did not contain identical data.
* `pairbackward_ms`: Total time spent extending matching block ranges backward from the first matching block found by hash table lookup.
* `pairbackward_overlap`: A pair of matching block ranges could not be extended backward by one block because this would cause the two block ranges to overlap.
* `pairbackward_same`: A pair of matching block ranges could not be extended backward by one block because this would cause the two block ranges to refer to the same btrfs data extent.
* `pairbackward_stop`: Stopped extending a pair of matching block ranges backward for any of the reasons listed here.
* `pairbackward_toxic_addr`: A pair of matching block ranges was abandoned because the extended range would include a data block with a toxic address.
* `pairbackward_toxic_hash`: A pair of matching block ranges was abandoned because the extended range would include a data block with a toxic hash.
* `pairbackward_try`: Started extending a pair of matching block ranges backward.
* `pairbackward_zero`: A pair of matching block ranges could not be extended backward by one block because the src block contained all zeros and was not compressed.
pairforward
-----------
The `pairforward` event group consists of events related to extending matching block ranges forward starting from the initial block match found using the hash table.
* `pairforward_eof_first`: A matching pair of block ranges could not be extended forward because the end of the first (src) file was reached.
* `pairforward_eof_malign`: A matching pair of block ranges could not be extended forward because the end of the second (dst) file was not aligned to a 4K boundary nor the end of the first (src) file.
* `pairforward_eof_second`: A matching pair of block ranges could not be extended forward because the end of the second (dst) file was reached.
* `pairforward_hit`: A pair of matching block ranges was extended forward by one block.
* `pairforward_hole`: A pair of matching block ranges was extended forward by one block, and the block was a hole in the second (dst) file.
* `pairforward_miss`: A pair of matching block ranges could not be extended forward by one block because the pair of blocks after the last block in the range did not contain identical data.
* `pairforward_ms`: Total time spent extending matching block ranges forward from the first matching block found by hash table lookup.
* `pairforward_overlap`: A pair of matching block ranges could not be extended forward by one block because this would cause the two block ranges to overlap.
* `pairforward_same`: A pair of matching block ranges could not be extended forward by one block because this would cause the two block ranges to refer to the same btrfs data extent.
* `pairforward_stop`: Stopped extending a pair of matching block ranges forward for any of the reasons listed here.
* `pairforward_toxic_addr`: A pair of matching block ranges was abandoned because the extended range would include a data block with a toxic address.
* `pairforward_toxic_hash`: A pair of matching block ranges was abandoned because the extended range would include a data block with a toxic hash.
* `pairforward_try`: Started extending a pair of matching block ranges forward.
* `pairforward_zero`: A pair of matching block ranges could not be extended backward by one block because the src block contained all zeros and was not compressed.
progress
--------
The `progress` event group consists of events related to progress estimation.
* `progress_no_data_bg`: Failed to retrieve any data block groups from the filesystem.
* `progress_not_created`: A crawler for one size tier had not been created for the extent scanner.
* `progress_complete`: A crawler for one size tier has completed a scan.
* `progress_not_found`: The extent position for a crawler does not correspond to any block group.
* `progress_out_of_bg`: The extent position for a crawler does not correspond to any data block group.
* `progress_ok`: Table of progress and ETA created successfully.
readahead
---------
The `readahead` event group consists of events related to data prefetching (formerly calls to `posix_fadvise` or `readahead`, but now emulated in userspace).
* `readahead_bytes`: Number of bytes prefetched.
* `readahead_count`: Number of read calls.
* `readahead_clear`: Number of times the duplicate read cache was cleared.
* `readahead_fail`: Number of read errors during prefetch.
* `readahead_ms`: Total time spent emulating readahead in user-space (kernel readahead is not measured).
* `readahead_skip`: Number of times a duplicate read was identified in the cache and skipped.
* `readahead_unread_ms`: Total time spent running `posix_fadvise(..., POSIX_FADV_DONTNEED)`.
replacedst
----------
The `replacedst` event group consists of events related to replacing a single reference to a dst extent using any suitable src extent (i.e. eliminating a single duplicate extent ref during a crawl).
* `replacedst_dedup_hit`: A duplicate extent reference was identified and removed.
* `replacedst_dedup_miss`: A duplicate extent reference was identified, but src and dst extents did not match (i.e. the filesystem changed in the meantime).
* `replacedst_grown`: A duplicate block was identified, and adjacent blocks were duplicate as well.
* `replacedst_overlaps`: A pair of duplicate block ranges was identified, but the pair was not usable for dedupe because the two ranges overlap.
* `replacedst_same`: A pair of duplicate block ranges was identified, but the pair was not usable for dedupe because the physical block ranges were the same.
* `replacedst_try`: A duplicate block was identified and an attempt was made to remove it (i.e. this is the total number of replacedst calls).
replacesrc
----------
The `replacesrc` event group consists of events related to replacing every reference to a src extent using a temporary copy of the extent's data (i.e. eliminating leftover unique data in a partially duplicate extent during a crawl).
* `replacesrc_dedup_hit`: A duplicate extent reference was identified and removed.
* `replacesrc_dedup_miss`: A duplicate extent reference was identified, but src and dst extents did not match (i.e. the filesystem changed in the meantime).
* `replacesrc_grown`: A duplicate block was identified, and adjacent blocks were duplicate as well.
* `replacesrc_overlaps`: A pair of duplicate block ranges was identified, but the pair was not usable for dedupe because the two ranges overlap.
* `replacesrc_try`: A duplicate block was identified and an attempt was made to remove it (i.e. this is the total number of replacedst calls).
resolve
-------
The `resolve` event group consists of operations related to translating a btrfs virtual block address (i.e. physical block address) to a `(root, inode, offset)` tuple (i.e. locating and opening the file containing a matching block). `resolve` is the top level, `chase` and `adjust` are the lower two levels.
* `resolve_empty`: The `LOGICAL_INO` ioctl returned successfully with an empty reference list (0 items).
* `resolve_fail`: The `LOGICAL_INO` ioctl returned an error.
* `resolve_large`: The `LOGICAL_INO` ioctl returned more than 2730 results (the limit of the v1 ioctl).
* `resolve_ms`: Total time spent in the `LOGICAL_INO` ioctl (i.e. wallclock time, not kernel CPU time).
* `resolve_ok`: The `LOGICAL_INO` ioctl returned success.
* `resolve_overflow`: The `LOGICAL_INO` ioctl returned 9999 or more extents (the limit configured in `bees.h`).
* `resolve_toxic`: The `LOGICAL_INO` ioctl took more than 0.1 seconds of kernel CPU time.
root
----
The `root` event group consists of operations related to translating a btrfs root ID (i.e. subvol ID) into an open file descriptor by navigating the btrfs root tree.
* `root_clear`: The root FD cache was cleared.
* `root_found`: A root FD was successfully opened.
* `root_notfound`: A root FD could not be opened because all candidate paths could not be opened, or there were no paths available.
* `root_ok`: A root FD was opened and its correctness verified.
* `root_open_fail`: A root FD `open()` attempt returned an error.
* `root_parent_open_fail`: A recursive call to open the parent of a subvol failed.
* `root_parent_open_ok`: A recursive call to open the parent of a subvol succeeded.
* `root_parent_open_try`: A recursive call to open the parent of a subvol was attempted.
* `root_parent_path_empty`: No path could be found to connect a parent root FD to its child.
* `root_parent_path_fail`: The `INO_PATH` ioctl failed to find a name for a child subvol relative to its parent.
* `root_parent_path_open_fail`: The `open()` call in a recursive call to open the parent of a subvol returned an error.
* `root_workaround_btrfs_send`: A subvol was determined to be read-only and disabled to implement the btrfs send workaround.
scan
----
The `scan` event group consists of operations related to scanning incoming data. This is where bees finds duplicate data and populates the hash table.
* `scan_blacklisted`: A blacklisted extent was passed to `scan_forward` and dropped.
* `scan_block`: A block of data was scanned.
* `scan_compressed_no_dedup`: An extent that was compressed contained non-zero, non-duplicate data.
* `scan_dup_block`: Number of duplicate block references deduped.
* `scan_dup_hit`: A pair of duplicate block ranges was found.
* `scan_dup_miss`: A pair of duplicate blocks was found in the hash table but not in the filesystem.
* `scan_extent`: An extent was scanned (`scan_one_extent`).
* `scan_forward`: A logical byte range was scanned (`scan_forward`).
* `scan_found`: An entry was found in the hash table matching a scanned block from the filesystem.
* `scan_hash_hit`: A block was found on the filesystem corresponding to a block found in the hash table.
* `scan_hash_miss`: A block was not found on the filesystem corresponding to a block found in the hash table.
* `scan_hash_preinsert`: A non-zero data block's hash was prepared for possible insertion into the hash table.
* `scan_hash_insert`: A non-zero data block's hash was inserted into the hash table.
* `scan_hole`: A hole extent was found during scan and ignored.
* `scan_interesting`: An extent had flags that were not recognized by bees and was ignored.
* `scan_lookup`: A hash was looked up in the hash table.
* `scan_malign`: A block being scanned matched a hash at EOF in the hash table, but the EOF was not aligned to a block boundary and the two blocks did not have the same length.
* `scan_push_front`: An entry in the hash table matched a duplicate block, so the entry was moved to the head of its LRU list.
* `scan_reinsert`: A copied block's hash and block address was inserted into the hash table.
* `scan_resolve_hit`: A block address in the hash table was successfully resolved to an open FD and offset pair.
* `scan_resolve_zero`: A block address in the hash table was not resolved to any subvol/inode pair, so the corresponding hash table entry was removed.
* `scan_rewrite`: A range of bytes in a file was copied, then the copy deduped over the original data.
* `scan_root_dead`: A deleted subvol was detected.
* `scan_seen_clear`: The list of recently scanned extents reached maximum size and was cleared.
* `scan_seen_erase`: An extent reference was modified by scan, so all future references to the extent must be scanned.
* `scan_seen_hit`: A scan was skipped because the same extent had recently been scanned.
* `scan_seen_insert`: An extent reference was not modified by scan and its hashes have been inserted into the hash table, so all future references to the extent can be ignored.
* `scan_seen_miss`: A scan was not skipped because the same extent had not recently been scanned (i.e. the extent was scanned normally).
* `scan_skip_bytes`: Nuisance dedupe or hole-punching would save less than half of the data in an extent.
* `scan_skip_ops`: Nuisance dedupe or hole-punching would require too many dedupe/copy/hole-punch operations in an extent.
* `scan_toxic_hash`: A scanned block has the same hash as a hash table entry that is marked toxic.
* `scan_toxic_match`: A hash table entry points to a block that is discovered to be toxic.
* `scan_twice`: Two references to the same block have been found in the hash table.
* `scan_zero`: A data block containing only zero bytes was detected.
scanf
-----
The `scanf` event group consists of operations related to `BeesContext::scan_forward`. This is the entry point where `crawl` schedules new data for scanning.
* `scanf_deferred_extent`: Two tasks attempted to scan the same extent at the same time, so one was deferred.
* `scanf_eof`: Scan past EOF was attempted.
* `scanf_extent`: A btrfs extent item was scanned.
* `scanf_extent_ms`: Total thread-seconds spent scanning btrfs extent items.
* `scanf_no_fd`: References to a block from the hash table were found, but a FD could not be opened.
* `scanf_total`: A logical byte range of a file was scanned.
* `scanf_total_ms`: Total thread-seconds spent scanning logical byte ranges.
Note that in current versions of bees, `scan_forward` is passed extents
that correspond exactly to btrfs extent items, so the `scanf_extent` and
`scanf_total` numbers can only be different if the filesystem changes
between crawl time and scan time.
sync
----
The `sync` event group consists of operations related to the `fsync` workarounds in bees.
* `sync_count`: `fsync()` was called on a temporary file.
* `sync_ms`: Total time spent executing `fsync()`.
tmp
---
The `sync` event group consists of operations related temporary files and the data within them.
* `tmp_aligned`: A temporary extent was allocated on a block boundary.
* `tmp_block`: Total number of temporary blocks copied.
* `tmp_block_zero`: Total number of temporary hole blocks copied.
* `tmp_bytes`: Total number of temporary bytes copied.
* `tmp_copy`: Total number of extents copied.
* `tmp_copy_ms`: Total time spent copying extents.
* `tmp_create`: Total number of temporary files created.
* `tmp_create_ms`: Total time spent creating temporary files.
* `tmp_hole`: Total number of hole extents created.
* `tmp_realign`: A temporary extent was not aligned to a block boundary.
* `tmp_resize`: A temporary file was resized with `ftruncate()`
* `tmp_resize_ms`: Total time spent in `ftruncate()`
* `tmp_trunc`: The temporary file size limit was exceeded, triggering a new temporary file creation.

230
docs/gotchas.md Normal file
View File

@@ -0,0 +1,230 @@
bees Gotchas
============
C++ Exceptions
--------------
bees is very paranoid about the data it gets from btrfs, and if btrfs
does anything bees does not expect, bees will throw an exception and move
on without touching the offending data. This will trigger a stack trace
to the log containing data which is useful for developers to understand
what happened.
In all cases C++ exceptions in bees are harmless to data on the
filesystem. bees handles most exceptions by aborting processing of
the current extent and moving to the next extent. In some cases an
exception may occur in a critical bees thread, which will stop the bees
process from making any further progress; however, these cases are rare
and are typically caused by unusual filesystem conditions (e.g. [freshly
formatted filesystem with no
data](https://github.com/Zygo/bees/issues/93)) or lack of memory or
other resources.
The following are common cases that users may encounter:
* If a snapshot is deleted, bees will generate a burst of exceptions for
references to files in the snapshot that no longer exist. This lasts
until the FD caches are cleared, usually a few minutes with default
btrfs mount options. These generally look like:
`std::system_error: BTRFS_IOC_TREE_SEARCH_V2: [path] at fs.cc:844: No such file or directory`
* If data is modified at the same time it is being scanned, bees will get
an inconsistent version of the data layout in the filesystem, causing
the `ExtentWalker` class to throw various constraint-check exceptions.
The exception causes bees to retry the extent in a later filesystem scan
(hopefully when the file is no longer being modified). The exception
text is similar to:
`std::runtime_error: fm.rbegin()->flags() = 776 failed constraint check (fm.rbegin()->flags() & FIEMAP_EXTENT_LAST) at extentwalker.cc:229`
but the line number or specific code fragment may vary.
* If there are too many possible matching blocks within a pair of extents,
bees will loop billions of times considering all possibilities. This is
a waste of time, so an exception is currently used to break out of such
loops early. The exception text in this case is:
`FIXME: too many duplicate candidates, bailing out here`
Terminating bees with SIGTERM
-----------------------------
bees is designed to survive host crashes, so it is safe to terminate bees
using SIGKILL; however, when bees next starts up, it will repeat some
work that was performed between the last bees crawl state save point
and the SIGKILL (up to 15 minutes), and a large hash table may not be
completely written back to disk, so some duplicate matches will be lost.
If bees is stopped and started less than once per week, then this is not
a problem as the proportional impact is quite small; however, users who
stop and start bees daily or even more often may prefer to have a clean
shutdown with SIGTERM so bees can restart faster.
The shutdown procedure performs these steps:
1. Crawl state is saved to `$BEESHOME`. This is the most
important bees state to save to disk as it directly impacts
restart time, so it is done as early as possible
2. Hash table is written to disk. Normally the hash table is
trickled back to disk at a rate of about 128KiB per second;
however, SIGTERM causes bees to attempt to flush the whole table
immediately. The time spent here depends on the size of RAM, speed
of disks, and aggressiveness of competing filesystem workloads.
It can trigger `vm.dirty_bytes` limits and block other processes
writing to the filesystem for a while.
3. The bees process calls `_exit`, which terminates all running
worker threads, closes and deletes all temporary files. This
can take a while _after_ the bees process exits, especially on
slow spinning disks.
Balances
--------
A btrfs balance relocates data on disk by making a new copy of the
data, replacing all references to the old data with references to the
new copy, and deleting the old copy. To bees, this is the same as any
other combination of new and deleted data (e.g. from defrag, or ordinary
file operations): some new data has appeared (to be scanned) and some
old data has disappeared (to be removed from the hash table when it is
detected).
As bees scans the newly balanced data, it will get hits on the hash
table pointing to the old data (it's identical data, so it would look
like a duplicate). These old hash table entries will not be valid any
more, so when bees tries to compare new data with old data, it will not
be able to find the old data at the old address, and bees will delete
the hash table entries. If no other duplicates are found, bees will
then insert new hash table entries pointing to the new data locations.
The erase is performed before the insert, so the new data simply replaces
the old and there is (little or) no impact on hash table entry lifetimes
(depending on how overcommitted the hash table is). Each block is
processed one at a time, which can be slow if there are many of them.
Routine btrfs maintenance balances rarely need to relocate more than 0.1%
of the total filesystem data, so the impact on bees is small even after
taking into account the extra work bees has to do.
If the filesystem must undergo a full balance (e.g. because disks were
added or removed, or to change RAID profiles), then every data block on
the filesystem will be relocated to a new address, which invalidates all
the data in the bees hash table at once. In such cases it is a good idea to:
1. Stop bees before the full balance starts,
2. Wipe the `$BEESHOME` directory (or delete and recreate `beeshash.dat`),
3. Restart bees after the full balance is finished.
bees will perform a full filesystem scan automatically after the balance
since all the data has "new" btrfs transids. bees won't waste any time
invalidating stale hash table data after the balance if the hash table
is empty. This can considerably improve the performance of both bees
(since it has no stale hash table entries to invalidate) and btrfs balance
(since it's not competing with bees for iops).
Snapshots
---------
bees can dedupe filesystems with many snapshots, but bees only does
well in this situation if bees was running on the filesystem from
the beginning.
Each time bees dedupes an extent that is referenced by a snapshot,
the entire metadata page in the snapshot subvol (16KB by default) must
be CoWed in btrfs. Since all references must be removed at the same
time, this CoW operation is repeated in every snapshot containing the
duplicate data. This can result in a substantial increase in btrfs
metadata size if there are many snapshots on a filesystem.
Normally, metadata is small (less than 1% of the filesystem) and dedupe
hit rates are large (10-40% of the filesystem), so the increase in
metadata size is offset by much larger reductions in data size and the
total space used by the entire filesystem is reduced.
If a subvol is deduped _before_ a snapshot is created, the snapshot will
have the same deduplication as the subvol. This does _not_ result in
unusually large metadata sizes. If a snapshot is made after bees has
fully scanned the origin subvol, bees can avoid scanning most of the
data in the snapshot subvol, as it will be provably identical to the
origin subvol that was already scanned.
If a subvol is deduped _after_ a snapshot is created, the origin and
snapshot subvols must be deduplicated separately. In the worst case, this
will double the amount of reading the bees scanner must perform, and will
also double the amount of btrfs metadata used for the snapshot; however,
the "worst case" is a dedupe hit rate of 1% or more, so a doubling of
metadata size is certain for all but the most unique data sets. Also,
bees will not be able to free any space until the last snapshot has been
scanned and deduped, so payoff in data space savings is deferred until
the metadata has almost finished expanding.
If a subvol is deduped after _many_ snapshots have been created, all
subvols must be deduplicated individually. In the worst case, this will
multiply the scanning work and metadata size by the number of snapshots.
For 100 snapshots this can mean a 100x growth in metadata size and
bees scanning time, which typically exceeds the possible savings from
reducing the data size by dedupe. In such cases using bees will result
in a net increase in disk space usage that persists until the snapshots
are deleted.
Snapshot case studies
---------------------
* bees running on an empty filesystem
* filesystem is mkfsed
* bees is installed and starts running
* data is written to the filesystem
* bees dedupes the data as it appears
* a snapshot is made of the data
* The snapshot will already be 99% deduped, so the metadata will
not expand very much because only 1% of the data in the snapshot
must be deduped.
* more snapshots are made of the data
* as long as dedupe has been completed on the origin subvol,
bees will quickly scan each new snapshot because it can skip
all the previously scanned data. Metadata usage remains low
(it may even shrink because there are fewer csums).
* bees installed on a non-empty filesystem with snapshots
* filesystem is mkfsed
* data is written to the filesystem
* multiple snapshots are made of the data
* bees is installed and starts running
* bees dedupes each snapshot individually
* The snapshot metadata will no longer be shared, resulting in
substantial growth of metadata usage.
* Disk space savings do not occur until bees processes the
last snapshot reference to data.
Other Gotchas
-------------
* bees avoids the [slow backrefs kernel bug](btrfs-kernel.md) by
measuring the time required to perform `LOGICAL_INO` operations.
If an extent requires over 5.0 kernel CPU seconds to perform a
`LOGICAL_INO` ioctl, then bees blacklists the extent and avoids
referencing it in future operations. In most cases, fewer than 0.1%
of extents in a filesystem must be avoided this way. This results
in short write latency spikes as btrfs will not allow writes to the
filesystem while `LOGICAL_INO` is running. Generally the CPU spends
most of the runtime of the `LOGICAL_INO` ioctl running the kernel,
so on a single-core CPU the entire system can freeze up for a second
during operations on toxic extents. Note this only occurs on older
kernels. See [the slow backrefs kernel bug section](btrfs-kernel.md).
* If a process holds a directory FD open, the subvol containing the
directory cannot be deleted (`btrfs sub del` will start the deletion
process, but it will not proceed past the first open directory FD).
`btrfs-cleaner` will simply skip over the directory *and all of its
children* until the FD is closed. bees avoids this gotcha by closing
all of the FDs in its directory FD cache every btrfs transaction.
* If a file is deleted while bees is caching an open FD to the file,
bees continues to scan the file. For very large files (e.g. VM
images), the deletion of the file can be delayed indefinitely.
To limit this delay, bees closes all FDs in its file FD cache every
btrfs transaction.

102
docs/how-it-works.md Normal file
View File

@@ -0,0 +1,102 @@
How bees Works
--------------
bees is a daemon designed to run continuously and maintain its state
across crashes and reboots.
bees uses checkpoints for persistence to eliminate the IO overhead of a
transactional data store. On restart, bees will dedupe any data that
was added to the filesystem since the last checkpoint. Checkpoints
occur every 15 minutes for scan progress, stored in `beescrawl.dat`.
The hash table trickle-writes to disk at 128KiB/s to `beeshash.dat`,
but will flush immediately if bees is terminated by SIGTERM.
There are no special requirements for bees hash table storage--`.beeshome`
could be stored on a different btrfs filesystem, ext4, or even CIFS (but
not MS-DOS--beeshome does need filenames longer than 8.3).
bees uses a persistent dedupe hash table with a fixed size configured
by the user. Any size of hash table can be dedicated to dedupe. If a
fast dedupe with low hit rate is desired, bees can use a hash table as
small as 128KB.
The bees hash table is loaded into RAM at startup and `mlock`ed so it
will not be swapped out by the kernel (if swap is permitted, performance
degrades to nearly zero, for both bees and the swap device).
bees scans the filesystem in a single pass which removes duplicate
extents immediately after they are detected. There are no distinct
scanning and dedupe phases, so bees can start recovering free space
immediately after startup.
Once a filesystem scan has been completed, bees uses the `min_transid`
parameter of the `TREE_SEARCH_V2` ioctl to avoid rescanning old data
on future scans and quickly scan new data. An incremental data scan
can complete in less than a millisecond on an idle filesystem.
Once a duplicate data block is identified, bees examines the nearby
blocks in the files where the matched block appears. This allows bees
to find long runs of adjacent duplicate block pairs if it has an entry
for any one of the blocks in its hash table. On typical data sets,
this means most of the blocks in the hash table are redundant and can
be discarded without significant impact on dedupe hit rate.
Hash table entries are grouped together into LRU lists. As each block
is scanned, its hash table entry is inserted into the LRU list at a
random position. If the LRU list is full, the entry at the end of the
list is deleted. If a hash table entry is used to discover duplicate
blocks, the entry is moved to the beginning of the list. This makes bees
unable to detect a small number of duplicates, but it dramatically
improves efficiency on filesystems with many small files.
Once the hash table fills up, old entries are evicted by new entries.
This means that the optimum hash table size is determined by the
distance between duplicate blocks on the filesystem rather than the
filesystem unique data size. Even if the hash table is too small
to find all duplicates, it may still find _most_ of them, especially
during incremental scans where the data in many workloads tends to be
more similar.
When a duplicate block pair is found in two btrfs extents, bees will
attempt to match all other blocks in the newer extent with blocks in
the older extent (i.e. the goal is to keep the extent referenced in the
hash table and remove the most recently scanned extent). If this is
possible, then the new extent will be replaced with a reference to the
old extent. If this is not possible, then bees will create a temporary
copy of the unmatched data in the new extent so that the entire new
extent can be removed by deduplication. This must be done because btrfs
cannot partially overwrite extents--the _entire_ extent must be replaced.
The temporary copy is then scanned during the next pass bees makes over
the filesystem for potential duplication of other extents.
When a block containing all-zero bytes is found, bees dedupes the extent
against a temporary file containing a hole, possibly creating temporary
copies of any non-zero data in the extent for later deduplication as
described above. If the extent is compressed, bees avoids splitting
the extent in the middle as this generally has a negative impact on
compression ratio (and also triggers a [kernel bug](btrfs-kernel.md)).
bees does not store any information about filesystem structure, so
its performance is linear in the number or size of files. The hash
table stores physical block numbers which are converted into paths
and FDs on demand through btrfs `SEARCH_V2` and `LOGICAL_INO` ioctls.
This eliminates the storage required to maintain the equivalents
of these functions in userspace, at the expense of encountering [some
kernel bugs in `LOGICAL_INO` performance](btrfs-kernel.md).
bees uses only the data-safe `FILE_EXTENT_SAME` (aka `FIDEDUPERANGE`)
kernel ioctl to manipulate user data, so it can dedupe live data
(e.g. build servers, sqlite databases, VM disk images). bees does not
modify file attributes or timestamps in deduplicated files.
When bees has scanned all of the data, bees will pause until a new
transaction has completed in the btrfs filesystem. bees tracks
the current btrfs transaction ID over time so that it polls less often
on quiescent filesystems and more often on busy filesystems.
Scanning and deduplication work is performed by worker threads. If the
[`--loadavg-target` option](options.md) is used, bees adjusts the number
of worker threads up or down as required to have a user-specified load
impact on the system. The maximum and minimum number of threads is
configurable. If the system load is too high then bees will stop until
the load falls to acceptable levels.

74
docs/index.md Normal file
View File

@@ -0,0 +1,74 @@
BEES
====
Best-Effort Extent-Same, a btrfs deduplication agent.
About bees
----------
bees is a block-oriented userspace deduplication agent designed to scale
up to large btrfs filesystems. It is an offline dedupe combined with
an incremental data scan capability to minimize time data spends on disk
from write to dedupe.
Strengths
---------
* Space-efficient hash table - can use as little as 1 GB hash table per 10 TB unique data (0.1GB/TB)
* Daemon mode - incrementally dedupes new data as it appears
* Largest extents first - recover more free space during fixed maintenance windows
* Works with btrfs compression - dedupe any combination of compressed and uncompressed files
* Whole-filesystem dedupe - scans data only once, even with snapshots and reflinks
* Persistent hash table for rapid restart after shutdown
* Constant hash table size - no increased RAM usage if data set becomes larger
* Works on live data - no scheduled downtime required
* Automatic self-throttling - reduces system load
* btrfs support - recovers more free space from btrfs than naive dedupers
Weaknesses
----------
* Whole-filesystem dedupe - has no include/exclude filters, does not accept file lists
* Requires root privilege (`CAP_SYS_ADMIN` plus the usual filesystem read/modify caps)
* [First run may increase metadata space usage if many snapshots exist](gotchas.md)
* Constant hash table size - no decreased RAM usage if data set becomes smaller
* btrfs only
Installation and Usage
----------------------
* [Installation](install.md)
* [Configuration](config.md)
* [Running](running.md)
* [Command Line Options](options.md)
Recommended Reading
-------------------
* [bees Gotchas](gotchas.md)
* [btrfs kernel bugs](btrfs-kernel.md) - especially DATA CORRUPTION WARNING for old kernels
* [bees vs. other btrfs features](btrfs-other.md)
* [What to do when something goes wrong](wrong.md)
More Information
----------------
* [How bees works](how-it-works.md)
* [Missing bees features](missing.md)
* [Event counter descriptions](event-counters.md)
Bug Reports and Contributions
-----------------------------
Email bug reports and patches to Zygo Blaxell <bees@furryterror.org>.
You can also use Github:
https://github.com/Zygo/bees
Copyright & License
-------------------
Copyright 2015-2025 Zygo Blaxell <bees@furryterror.org>.
GPL (version 3 or later).

91
docs/install.md Normal file
View File

@@ -0,0 +1,91 @@
Building bees
=============
Dependencies
------------
* C++11 compiler (tested with GCC 8.1.0, 12.2.0)
Sorry. I really like closures and shared_ptr, so support
for earlier compiler versions is unlikely.
Note that the C++ standard--and GCC's implementation of it--is evolving.
There may be problems when building with newer compiler versions.
Build failure reports welcome!
* btrfs-progs
Needed at runtime by the service wrapper script.
* [Linux kernel version](btrfs-kernel.md) gets its own page.
* markdown to build the documentation
* util-linux version that provides `blkid` command for the helper
script `scripts/beesd` to work
Installation
============
bees can be installed by following one these instructions:
Arch package
------------
bees is available for Arch Linux in the community repository. Install with:
`$ pacman -S bees`
or build a live version from git master using AUR:
`$ git clone https://aur.archlinux.org/bees-git.git && cd bees-git && makepkg -si`
Gentoo package
--------------
bees is officially available in Gentoo Portage. Just emerge a stable
version:
`$ emerge --ask bees`
or build a live version from git master:
`$ emerge --ask =bees-9999`
You can opt-out of building the support tools with
`USE="-tools" emerge ...`
If you want to start hacking on bees and contribute changes, just emerge
the live version which automatically pulls in all required development
packages.
Build from source
-----------------
Build with `make`. The build produces `bin/bees` which must be copied
to somewhere in `$PATH` on the target system respectively.
It will also generate `scripts/beesd@.service` for systemd users. This
service makes use of a helper script `scripts/beesd` to boot the service.
Both of the latter use the filesystem UUID to mount the root subvolume
within a temporary runtime directory.
### Ubuntu 16.04 - 17.04:
`$ apt -y install build-essential btrfs-tools markdown && make`
### Ubuntu 18.10:
`$ apt -y install build-essential btrfs-progs markdown && make`
Packaging
---------
See 'Dependencies' above. Package maintainers can pick ideas for building and
configuring the source package from the Gentoo ebuild:
<https://github.com/gentoo/gentoo/tree/master/sys-fs/bees>
You can configure some build options by creating a file `localconf` and
adjust settings for your distribution environment there.
Please also review the Makefile for additional hints.

42
docs/missing.md Normal file
View File

@@ -0,0 +1,42 @@
Features You Might Expect That bees Doesn't Have
------------------------------------------------
* There's no configuration file (patches welcome!). There are
some tunables hardcoded in the source (`src/bees.h`) that could eventually
become configuration options. There's also an incomplete option parser
(patches welcome!).
* The bees process doesn't fork and writes its log to stdout/stderr.
A shell wrapper is required to make it behave more like a daemon.
* There's no facility to exclude any part of a filesystem or focus on
specific files (patches welcome).
* PREALLOC extents and extents containing blocks filled with zeros will
be replaced by holes. There is no way to turn this off.
* The fundamental unit of deduplication is the extent _reference_, when
it should be the _extent_ itself. This is an architectural limitation
that results in excess reads of extent data, even in the Extent scan mode.
* Block reads are currently more allocation- and CPU-intensive than they
should be, especially for filesystems on SSD where the IO overhead is
much smaller. This is a problem for CPU-power-constrained environments
(e.g. laptops running from battery, or ARM devices with slow CPU).
* bees can currently fragment extents when required to remove duplicate
blocks, but has no defragmentation capability yet. When possible, bees
will attempt to work with existing extent boundaries and choose the
largest fragments available, but it will not aggregate blocks together
from multiple extents to create larger ones.
* When bees fragments an extent, the copied data is compressed. There
is currently no way (other than by modifying the source) to select a
compression method or not compress the data (patches welcome!).
* It is theoretically possible to resize the hash table without starting
over with a new full-filesystem scan; however, this feature has not been
implemented yet.
* btrfs maintains csums of data blocks which bees could use to improve
scan speeds, but bees doesn't use them yet.

124
docs/options.md Normal file
View File

@@ -0,0 +1,124 @@
# bees Command Line Options
## Load management options
* `--thread-count COUNT` or `-c`
Specify maximum number of worker threads. Overrides `--thread-factor`
(`-C`), default/autodetected values, and the hardcoded thread limit.
* `--thread-factor FACTOR` or `-C`
Specify ratio of worker threads to detected CPU cores. Overridden by
`--thread-count` (`-c`).
Default is 1.0, i.e. 1 worker thread per detected CPU. Use values
below 1.0 to leave some cores idle, or above 1.0 if there are more
disks than CPUs in the filesystem.
* `--loadavg-target LOADAVG` or `-g`
Specify load average target for dynamic worker threads. Default is
to run the maximum number of worker threads all the time.
Worker threads will be started or stopped subject to the upper limit
imposed by `--thread-factor`, `--thread-min` and `--thread-count`
until the load average is within +/- 0.5 of `LOADAVG`.
* `--thread-min COUNT` or `-G`
Specify minimum number of dynamic worker threads. This can be used
to force a minimum number of threads to continue running while using
`--loadavg-target` to manage load.
Default is 0, i.e. all bees worker threads will stop when the system
load exceeds the target.
Has no effect unless `--loadavg-target` is used to specify a target load.
* `--throttle-factor FACTOR`
In order to avoid saturating btrfs deferred work queues, bees tracks
the time that operations with delayed effect (dedupe and tmpfile copy)
and operations with long run times (`LOGICAL_INO`) run. If an operation
finishes before the average run time for that operation, bees will
sleep for the remainder of the average run time, so that operations
are submitted to btrfs at a rate similar to the rate that btrfs can
complete them.
The `FACTOR` is multiplied by the average run time for each operation
to calculate the target delay time.
`FACTOR` 0 is the default, which adds no delays. bees will attempt
to saturate btrfs delayed work queues as quickly as possible, which
may impact other processes on the same filesystem, or even slow down
bees itself.
`FACTOR` 1.0 will attempt to keep btrfs delayed work queues filled at
a steady average rate.
`FACTOR` more than 1.0 will add delays longer than the average
run time (e.g. 10.0 will delay all operations that take less than 10x
the average run time). High values of `FACTOR` may be desirable when
using bees with other applications on the same filesystem.
The maximum delay per operation is 60 seconds.
## Filesystem tree traversal options
* `--scan-mode MODE` or `-m`
Specify extent scanning algorithm.
**EXPERIMENTAL** feature that may go away.
* Mode 0: lockstep
* Mode 1: independent
* Mode 2: sequential
* Mode 3: recent
* Mode 4: extent
For details of the different scanning modes and the default value of
this option, see [bees configuration](config.md).
## Workarounds
* `--workaround-btrfs-send` or `-a`
_This option is obsolete and should not be used any more._
Pretend that read-only snapshots are empty and silently discard any
request to dedupe files referenced through them. This is a workaround
for [problems with old kernels running `btrfs send` and `btrfs send
-p`](btrfs-kernel.md) which make these btrfs features unusable with bees.
This option was used to avoid breaking `btrfs send` on old kernels.
The affected kernels are now too old to be recommended for use with bees.
bees now waits for `btrfs send` to finish. There is no need for an
option to enable this.
**Note:** There is a _significant_ space tradeoff when using this option:
it is likely no space will be recovered--and possibly significant extra
space used--until the read-only snapshots are deleted.
## Logging options
* `--timestamps` or `-t`
Enable timestamps in log output.
* `--no-timestamps` or `-T`
Disable timestamps in log output.
* `--absolute-paths` or `-p`
Paths in log output will be absolute.
* `--strip-paths` or `-P`
Paths in log output will have the working directory at bees startup stripped.
* `--verbose` or `-v`
Set log verbosity (0 = no output, 8 = all output, default 8).

91
docs/running.md Normal file
View File

@@ -0,0 +1,91 @@
Running bees
============
Setup
-----
If you don't want to use the helper script `scripts/beesd` to setup and
configure bees, here's how you manually setup bees.
Create a directory for bees state files:
export BEESHOME=/some/path
mkdir -p "$BEESHOME"
Create an empty hash table ([your choice of size](config.md), but it
must be a multiple of 128KB). This example creates a 1GB hash table:
truncate -s 1g "$BEESHOME/beeshash.dat"
chmod 700 "$BEESHOME/beeshash.dat"
bees can _only_ process the root subvol of a btrfs with nothing mounted
over top. If the bees argument is not the root subvol directory, bees
will just throw an exception and stop.
Use a separate mount point, and let only bees access it:
UUID=3399e413-695a-4b0b-9384-1b0ef8f6c4cd
mkdir -p /var/lib/bees/$UUID
mount /dev/disk/by-uuid/$UUID /var/lib/bees/$UUID -osubvol=/
If you don't set BEESHOME, the path "`.beeshome`" will be used relative
to the root subvol of the filesystem. For example:
btrfs sub create /var/lib/bees/$UUID/.beeshome
truncate -s 1g /var/lib/bees/$UUID/.beeshome/beeshash.dat
chmod 700 /var/lib/bees/$UUID/.beeshome/beeshash.dat
You can use any relative path in `BEESHOME`. The path will be taken
relative to the root of the deduped filesystem (in other words it can
be the name of a subvol):
export BEESHOME=@my-beeshome
btrfs sub create /var/lib/bees/$UUID/$BEESHOME
truncate -s 1g /var/lib/bees/$UUID/$BEESHOME/beeshash.dat
chmod 700 /var/lib/bees/$UUID/$BEESHOME/beeshash.dat
Configuration
-------------
There are some runtime configurable options using environment variables:
* BEESHOME: Directory containing bees state files:
* beeshash.dat | persistent hash table. Must be a multiple of 128KB, and must be created before bees starts.
* beescrawl.dat | state of SEARCH_V2 crawlers. ASCII text. bees will create this.
* beesstats.txt | statistics and performance counters. ASCII text. bees will create this.
* BEESSTATUS: File containing a snapshot of current bees state: performance
counters and current status of each thread. The file is meant to be
human readable, but understanding it probably requires reading the source.
You can watch bees run in realtime with a command like:
watch -n1 cat $BEESSTATUS
Other options (e.g. interval between filesystem crawls) can be configured
in `src/bees.h` or [on the command line](options.md).
Running
-------
Reduce CPU and IO priority to be kinder to other applications sharing
this host (or raise them for more aggressive disk space recovery). If you
use cgroups, put `bees` in its own cgroup, then reduce the `blkio.weight`
and `cpu.shares` parameters. You can also use `schedtool` and `ionice`
in the shell script that launches `bees`:
schedtool -D -n20 $$
ionice -c3 -p $$
You can also use the [load management options](options.md) to further
control the impact of bees on the rest of the system.
Let the bees fly:
for fs in /var/lib/bees/*-*-*-*-*/; do
bees "$fs" >> "$fs/.beeshome/bees.log" 2>&1 &
done
You'll probably want to arrange for `/var/log/bees.log` to be rotated
periodically. You may also want to set umask to 077 to prevent disclosure
of information about the contents of the filesystem through the log file.
There are also some shell wrappers in the `scripts/` directory.

167
docs/wrong.md Normal file
View File

@@ -0,0 +1,167 @@
What to do when something goes wrong with bees
==============================================
Hangs and excessive slowness
----------------------------
### Use load-throttling options
If bees is just more aggressive than you would like, consider using
[load throttling options](options.md). These are usually more effective
than `ionice`, `schedtool`, and the `blkio` cgroup (though you can
certainly use those too) because they limit work that bees queues up
for later execution inside btrfs.
### Check `$BEESSTATUS`
If bees or the filesystem seems to be stuck, check the contents of
`$BEESSTATUS`. bees describes what it is doing (and how long it has
been trying to do it) through this file.
Sample:
<pre>
THREADS (work queue 68 tasks):
tid 20939: crawl_5986: dedup BeesRangePair: 512K src[0x9933f000..0x993bf000] dst[0x9933f000..0x993bf000]
src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
tid 20940: crawl_5986: dedup BeesRangePair: 512K src[0x992bf000..0x9933f000] dst[0x992bf000..0x9933f000]
src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
tid 21177: crawl_5986: dedup BeesRangePair: 512K src[0x9923f000..0x992bf000] dst[0x9923f000..0x992bf000]
src = 147 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
dst = 15 /run/bees/ede84fbd-cb59-0c60-9ea7-376fa4984887/data.new/home/builder/linux/.git/objects/pack/pack-09f06f8759ac7fd163df320b7f7671f06ac2a747.pack
tid 21677: bees: [68493.1s] main
tid 21689: crawl_transid: [236.508s] waiting 332.575s for next 10 transid RateEstimator { count = 87179, raw = 969.066 / 32229.2, ratio = 969.066 / 32465.7, rate = 0.0298489, duration(1) = 33.5021, seconds_for(1) = 1 }
tid 21690: status: writing status to file '/run/bees.status'
tid 21691: crawl_writeback: [203.456s] idle, dirty
tid 21692: hash_writeback: [12.466s] flush rate limited after extent #17 of 64 extents
tid 21693: hash_prefetch: [2896.61s] idle 3600s
</pre>
The time in square brackets indicates how long the thread has been
executing the current task (if this time is below 5 seconds then it
is omitted). We can see here that the main thread (and therefore the
bees process as a whole) has been running for 68493.1 seconds, the
last hash table write was 12.5 seconds ago, and the last transid poll
was 236.5 seconds ago. Three worker threads are currently performing
dedupe on extents.
Thread names of note:
* `bees`: main thread (doesn't do anything after startup, but its task execution time is that of the whole bees process)
* `crawl_master`: task that finds new extents in the filesystem and populates the work queue
* `crawl_transid`: btrfs transid (generation number) tracker and polling thread
* `status`: the thread that writes the status reports to `$BEESSTATUS`
* `crawl_writeback`: writes the scanner progress to `beescrawl.dat`
* `hash_writeback`: trickle-writes the hash table back to `beeshash.dat`
* `hash_prefetch`: prefetches the hash table at startup and updates `beesstats.txt` hourly
Most other threads have names that are derived from the current dedupe
task that they are executing:
* `ref_205ad76b1000_24K_50`: extent scan performing dedupe of btrfs extent bytenr `205ad76b1000`, which is 24 KiB long and has 50 references
* `extent_250_32M_16E`: extent scan searching for extents between 32 MiB + 1 and 16 EiB bytes long, tracking scan position in virtual subvol `250`.
* `crawl_378_18916`: subvol scan searching for extent refs in subvol `378`, inode `18916`.
### Dump kernel stacks of hung processes
Check the kernel stacks of all blocked kernel processes:
ps xar | while read -r x y; do ps "$x"; head -50 --verbose /proc/"$x"/task/*/stack; done | tee lockup-stacks.txt
Submit the above information in your bug report.
### Check dmesg for btrfs stack dumps
Sometimes these are relevant too.
bees Crashes
------------
* If you have a core dump, run these commands in gdb and include
the output in your report (you may need to post it as a compressed
attachment, as it can be quite large):
(gdb) set pagination off
(gdb) info shared
(gdb) bt
(gdb) thread apply all bt
(gdb) thread apply all bt full
The last line generates megabytes of output and will often crash gdb.
Submit whatever output gdb can produce.
**Note that this output may include filenames or data from your
filesystem.**
* If you have `systemd-coredump` installed, you can use `coredumpctl`:
(echo set pagination off;
echo info shared;
echo bt;
echo thread apply all bt;
echo thread apply all bt full) | coredumpctl gdb bees
* If the crash happens often (or don't want to use coredumpctl),
you can run automate the gdb data collection with this wrapper script:
<pre>
#!/bin/sh
set -x
# Move aside old core files for analysis
for x in core*; do
if [ -e "$x" ]; then
mv -vf "$x" "old-$x.$(date +%Y-%m-%d-%H-%M-%S)"
fi
done
# Delete old core files after a week
find old-core* -type f -mtime +7 -exec rm -vf {} + &
# Turn on the cores (FIXME: may need to change other system parameters
# that capture or redirect core files)
ulimit -c unlimited
# Run the command
"$@"
rv="$?"
# Don't clobber our core when gdb crashes
ulimit -c 0
# If there were core files, generate reports for them
for x in core*; do
if [ -e "$x" ]; then
gdb --core="$x" \
--eval-command='set pagination off' \
--eval-command='info shared' \
--eval-command='bt' \
--eval-command='thread apply all bt' \
--eval-command='thread apply all bt full' \
--eval-command='quit' \
--args "$@" 2>&1 | tee -a "$x.txt"
fi
done
# Return process exit status to caller
exit "$rv"
</pre>
To use the wrapper script, insert it just before the `bees` command,
as in:
gdb-wrapper bees /path/to/fs/
Kernel crashes, corruption, and filesystem damage
-------------------------------------------------
bees doesn't do anything that _should_ cause corruption or data loss;
however, [btrfs has kernel bugs](btrfs-kernel.md), so corruption is
not impossible.
Issues with the btrfs filesystem kernel code or other block device layers
should be reported to their respective maintainers.

View File

@@ -1,13 +0,0 @@
#ifndef CRUCIBLE_BOOL_H
#define CRUCIBLE_BOOL_H
namespace crucible {
struct DefaultBool {
bool m_b;
DefaultBool(bool init = false) : m_b(init) {}
operator bool() const { return m_b; }
bool &operator=(const bool &that) { return m_b = that; }
};
}
#endif // CRUCIBLE_BOOL_H

View File

@@ -0,0 +1,216 @@
#ifndef CRUCIBLE_BTRFS_TREE_H
#define CRUCIBLE_BTRFS_TREE_H
#include "crucible/fd.h"
#include "crucible/fs.h"
#include "crucible/bytevector.h"
namespace crucible {
using namespace std;
class BtrfsTreeItem {
uint64_t m_objectid = 0;
uint64_t m_offset = 0;
uint64_t m_transid = 0;
ByteVector m_data;
uint8_t m_type = 0;
public:
uint64_t objectid() const { return m_objectid; }
uint64_t offset() const { return m_offset; }
uint64_t transid() const { return m_transid; }
uint8_t type() const { return m_type; }
const ByteVector data() const { return m_data; }
BtrfsTreeItem() = default;
BtrfsTreeItem(const BtrfsIoctlSearchHeader &bish);
BtrfsTreeItem& operator=(const BtrfsIoctlSearchHeader &bish);
bool operator!() const;
/// Member access methods. Invoking a method on the
/// wrong type of item will throw an exception.
/// @{ Block group items
uint64_t block_group_flags() const;
uint64_t block_group_used() const;
/// @}
/// @{ Chunk items
uint64_t chunk_length() const;
uint64_t chunk_type() const;
/// @}
/// @{ Dev extent items (physical byte ranges)
uint64_t dev_extent_chunk_offset() const;
uint64_t dev_extent_length() const;
/// @}
/// @{ Dev items (devices)
uint64_t dev_item_total_bytes() const;
uint64_t dev_item_bytes_used() const;
/// @}
/// @{ Inode items
uint64_t inode_size() const;
/// @}
/// @{ Extent refs (EXTENT_DATA)
uint64_t file_extent_logical_bytes() const;
uint64_t file_extent_generation() const;
uint64_t file_extent_offset() const;
uint64_t file_extent_bytenr() const;
uint8_t file_extent_type() const;
btrfs_compression_type file_extent_compression() const;
/// @}
/// @{ Extent items (EXTENT_ITEM)
uint64_t extent_begin() const;
uint64_t extent_end() const;
uint64_t extent_flags() const;
uint64_t extent_generation() const;
/// @}
/// @{ Root items
uint64_t root_flags() const;
uint64_t root_refs() const;
/// @}
/// @{ Root backref items.
uint64_t root_ref_dirid() const;
string root_ref_name() const;
uint64_t root_ref_parent_rootid() const;
/// @}
};
ostream &operator<<(ostream &os, const BtrfsTreeItem &bti);
class BtrfsTreeFetcher {
protected:
Fd m_fd;
BtrfsIoctlSearchKey m_sk;
uint64_t m_tree = 0;
uint64_t m_min_transid = 0;
uint64_t m_max_transid = numeric_limits<uint64_t>::max();
uint64_t m_block_size = 0;
uint64_t m_lookbehind_size = 0;
uint64_t m_scale_size = 0;
uint8_t m_type = 0;
uint64_t scale_logical(uint64_t logical) const;
uint64_t unscale_logical(uint64_t logical) const;
const static uint64_t s_max_logical = numeric_limits<uint64_t>::max();
uint64_t scaled_max_logical() const;
virtual void fill_sk(BtrfsIoctlSearchKey &key, uint64_t object);
virtual void next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr);
virtual uint64_t hdr_logical(const BtrfsIoctlSearchHeader &hdr) = 0;
virtual bool hdr_match(const BtrfsIoctlSearchHeader &hdr) = 0;
virtual bool hdr_stop(const BtrfsIoctlSearchHeader &hdr) = 0;
Fd fd() const;
void fd(Fd fd);
public:
virtual ~BtrfsTreeFetcher() = default;
BtrfsTreeFetcher(Fd new_fd);
void type(uint8_t type);
uint8_t type();
void tree(uint64_t tree);
uint64_t tree();
void transid(uint64_t min_transid, uint64_t max_transid = numeric_limits<uint64_t>::max());
/// Block size (sectorsize) of filesystem
uint64_t block_size() const;
/// Fetch last object < logical, null if not found
BtrfsTreeItem prev(uint64_t logical);
/// Fetch first object > logical, null if not found
BtrfsTreeItem next(uint64_t logical);
/// Fetch object at exactly logical, null if not found
BtrfsTreeItem at(uint64_t);
/// Fetch first object >= logical
BtrfsTreeItem lower_bound(uint64_t logical);
/// Fetch last object <= logical
BtrfsTreeItem rlower_bound(uint64_t logical);
/// Estimated distance between objects
virtual uint64_t lookbehind_size() const;
virtual void lookbehind_size(uint64_t);
/// Scale size (normally block size but must be set to 1 for fs trees)
uint64_t scale_size() const;
void scale_size(uint64_t);
};
class BtrfsTreeObjectFetcher : public BtrfsTreeFetcher {
protected:
virtual void fill_sk(BtrfsIoctlSearchKey &key, uint64_t logical) override;
virtual uint64_t hdr_logical(const BtrfsIoctlSearchHeader &hdr) override;
virtual bool hdr_match(const BtrfsIoctlSearchHeader &hdr) override;
virtual bool hdr_stop(const BtrfsIoctlSearchHeader &hdr) override;
public:
using BtrfsTreeFetcher::BtrfsTreeFetcher;
};
class BtrfsTreeOffsetFetcher : public BtrfsTreeFetcher {
protected:
uint64_t m_objectid = 0;
virtual void fill_sk(BtrfsIoctlSearchKey &key, uint64_t offset) override;
virtual uint64_t hdr_logical(const BtrfsIoctlSearchHeader &hdr) override;
virtual bool hdr_match(const BtrfsIoctlSearchHeader &hdr) override;
virtual bool hdr_stop(const BtrfsIoctlSearchHeader &hdr) override;
public:
using BtrfsTreeFetcher::BtrfsTreeFetcher;
void objectid(uint64_t objectid);
uint64_t objectid() const;
};
class BtrfsCsumTreeFetcher : public BtrfsTreeOffsetFetcher {
public:
const uint32_t BTRFS_CSUM_TYPE_UNKNOWN = uint32_t(1) << 16;
private:
size_t m_sum_size = 0;
uint32_t m_sum_type = BTRFS_CSUM_TYPE_UNKNOWN;
public:
BtrfsCsumTreeFetcher(const Fd &fd);
uint32_t sum_type() const;
size_t sum_size() const;
void get_sums(uint64_t logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t count)> output);
};
/// Fetch extent items from extent tree.
/// Does not filter out metadata! See BtrfsDataExtentTreeFetcher for that.
class BtrfsExtentItemFetcher : public BtrfsTreeObjectFetcher {
public:
BtrfsExtentItemFetcher(const Fd &fd);
};
/// Fetch extent refs from an inode. Caller must set the tree and objectid.
class BtrfsExtentDataFetcher : public BtrfsTreeOffsetFetcher {
public:
BtrfsExtentDataFetcher(const Fd &fd);
};
/// Fetch raw inode items
class BtrfsInodeFetcher : public BtrfsTreeObjectFetcher {
public:
BtrfsInodeFetcher(const Fd &fd);
BtrfsTreeItem stat(uint64_t subvol, uint64_t inode);
};
/// Fetch a root (subvol) item
class BtrfsRootFetcher : public BtrfsTreeObjectFetcher {
public:
BtrfsRootFetcher(const Fd &fd);
BtrfsTreeItem root(uint64_t subvol);
BtrfsTreeItem root_backref(uint64_t subvol);
};
/// Fetch data extent items from extent tree, skipping metadata-only block groups
class BtrfsDataExtentTreeFetcher : public BtrfsExtentItemFetcher {
BtrfsTreeItem m_current_bg;
BtrfsTreeOffsetFetcher m_chunk_tree;
protected:
virtual void next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr) override;
public:
BtrfsDataExtentTreeFetcher(const Fd &fd);
};
}
#endif

View File

@@ -13,18 +13,22 @@
// __u64 typedef and friends
#include <linux/types.h>
// try Linux headers first
#include <btrfs/ioctl.h>
// the btrfs headers
#include <linux/btrfs.h>
#include <linux/btrfs_tree.h>
// Supply any missing definitions
#define mutex not_mutex
#include <btrfs/ctree.h>
// Repair the damage
#undef min
#undef max
#undef mutex
// And now all the things that have been missing in some version of
// the headers.
#ifndef BTRFS_FIRST_FREE_OBJECTID
enum btrfs_compression_type {
BTRFS_COMPRESS_NONE,
BTRFS_COMPRESS_ZLIB,
BTRFS_COMPRESS_LZO,
BTRFS_COMPRESS_ZSTD,
};
// BTRFS_CSUM_ITEM_KEY is not defined in include/uapi
#ifndef BTRFS_CSUM_ITEM_KEY
#define BTRFS_ROOT_TREE_OBJECTID 1ULL
#define BTRFS_EXTENT_TREE_OBJECTID 2ULL
@@ -74,9 +78,6 @@
#define BTRFS_SHARED_BLOCK_REF_KEY 182
#define BTRFS_SHARED_DATA_REF_KEY 184
#define BTRFS_BLOCK_GROUP_ITEM_KEY 192
#define BTRFS_FREE_SPACE_INFO_KEY 198
#define BTRFS_FREE_SPACE_EXTENT_KEY 199
#define BTRFS_FREE_SPACE_BITMAP_KEY 200
#define BTRFS_DEV_EXTENT_KEY 204
#define BTRFS_DEV_ITEM_KEY 216
#define BTRFS_CHUNK_ITEM_KEY 228
@@ -93,6 +94,18 @@
#endif
#ifndef BTRFS_FREE_SPACE_INFO_KEY
#define BTRFS_FREE_SPACE_INFO_KEY 198
#define BTRFS_FREE_SPACE_EXTENT_KEY 199
#define BTRFS_FREE_SPACE_BITMAP_KEY 200
#define BTRFS_FREE_SPACE_OBJECTID -11ULL
#endif
#ifndef BTRFS_BLOCK_GROUP_RAID1C4
#define BTRFS_BLOCK_GROUP_RAID1C3 (1ULL << 9)
#define BTRFS_BLOCK_GROUP_RAID1C4 (1ULL << 10)
#endif
#ifndef BTRFS_DEFRAG_RANGE_START_IO
// For some reason uapi has BTRFS_DEFRAG_RANGE_COMPRESS and
@@ -130,7 +143,7 @@
};
#endif
#ifndef BTRFS_IOC_CLONE_RANGE
struct btrfs_ioctl_clone_range_args {
@@ -158,7 +171,7 @@
__u64 bytes_deduped; /* out - total # of bytes we were able
* to dedupe from this file */
/* status of this dedupe operation:
* 0 if dedup succeeds
* 0 if dedupe succeeds
* < 0 for error
* == BTRFS_SAME_DATA_DIFFERS if data differs
*/
@@ -202,4 +215,51 @@
struct btrfs_ioctl_search_args_v2)
#endif
#ifndef BTRFS_IOC_LOGICAL_INO_V2
#define BTRFS_IOC_LOGICAL_INO_V2 _IOWR(BTRFS_IOCTL_MAGIC, 59, struct btrfs_ioctl_logical_ino_args)
#define BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET (1ULL << 0)
#endif
#ifndef BTRFS_FS_INFO_FLAG_CSUM_INFO
/* Request information about checksum type and size */
#define BTRFS_FS_INFO_FLAG_CSUM_INFO (1 << 0)
#endif
#ifndef BTRFS_FS_INFO_FLAG_GENERATION
/* Request information about filesystem generation */
#define BTRFS_FS_INFO_FLAG_GENERATION (1 << 1)
#endif
#ifndef BTRFS_FS_INFO_FLAG_METADATA_UUID
/* Request information about filesystem metadata UUID */
#define BTRFS_FS_INFO_FLAG_METADATA_UUID (1 << 2)
#endif
// BTRFS_CSUM_TYPE_CRC32 was a #define from 2008 to 2019.
// After that, it's an enum with the other 3 types.
// So if we do _not_ have CRC32 defined, it means we have the other 3;
// if we _do_ have CRC32 defined, it means we need the other 3.
// This seems likely to break some day.
#ifdef BTRFS_CSUM_TYPE_CRC32
#define BTRFS_CSUM_TYPE_XXHASH 1
#define BTRFS_CSUM_TYPE_SHA256 2
#define BTRFS_CSUM_TYPE_BLAKE2 3
#endif
struct btrfs_ioctl_fs_info_args_v3 {
__u64 max_id; /* out */
__u64 num_devices; /* out */
__u8 fsid[BTRFS_FSID_SIZE]; /* out */
__u32 nodesize; /* out */
__u32 sectorsize; /* out */
__u32 clone_alignment; /* out */
/* See BTRFS_FS_INFO_FLAG_* */
__u16 csum_type; /* out */
__u16 csum_size; /* out */
__u64 flags; /* in/out */
__u64 generation; /* out */
__u8 metadata_uuid[BTRFS_FSID_SIZE]; /* out */
__u8 reserved[944]; /* pad to 1k */
};
#endif // CRUCIBLE_BTRFS_H

View File

@@ -0,0 +1,80 @@
#ifndef _CRUCIBLE_BYTEVECTOR_H_
#define _CRUCIBLE_BYTEVECTOR_H_
#include <crucible/error.h>
#include <memory>
#include <mutex>
#include <ostream>
#include <cstdint>
#include <cstdlib>
namespace crucible {
using namespace std;
// new[] is a little slower than malloc
// shared_ptr is about 2x slower than unique_ptr
// vector<uint8_t> is ~160x slower
// so we won't bother with unique_ptr because we can't do shared copies with it
class ByteVector {
public:
using Pointer = shared_ptr<uint8_t>;
using value_type = Pointer::element_type;
using iterator = value_type*;
ByteVector() = default;
ByteVector(const ByteVector &that);
ByteVector& operator=(const ByteVector &that);
ByteVector(size_t size);
ByteVector(const ByteVector &that, size_t start, size_t length);
ByteVector(iterator begin, iterator end, size_t min_size = 0);
ByteVector at(size_t start, size_t length) const;
value_type& at(size_t) const;
iterator begin() const;
void clear();
value_type* data() const;
bool empty() const;
iterator end() const;
value_type& operator[](size_t) const;
size_t size() const;
bool operator==(const ByteVector &that) const;
// this version of erase only works at the beginning or end of the buffer, else throws exception
void erase(iterator first);
void erase(iterator first, iterator last);
// An important use case is ioctls that have a fixed-size header struct
// followed by a buffer for further arguments. These templates avoid
// doing reinterpret_casts every time.
template <class T> ByteVector(const T& object, size_t min_size);
template <class T> T* get() const;
private:
Pointer m_ptr;
size_t m_size = 0;
mutable mutex m_mutex;
};
template <class T>
ByteVector::ByteVector(const T& object, size_t min_size)
{
const auto size = max(min_size, sizeof(T));
m_ptr = Pointer(static_cast<value_type*>(malloc(size)), free);
memcpy(m_ptr.get(), &object, sizeof(T));
m_size = size;
}
template <class T>
T*
ByteVector::get() const
{
THROW_CHECK2(out_of_range, size(), sizeof(T), size() >= sizeof(T));
return reinterpret_cast<T*>(data());
}
ostream& operator<<(ostream &os, const ByteVector &bv);
}
#endif // _CRUCIBLE_BYTEVECTOR_H_

View File

@@ -3,8 +3,8 @@
#include "crucible/lockset.h"
#include <algorithm>
#include <functional>
#include <list>
#include <map>
#include <mutex>
#include <tuple>
@@ -17,17 +17,26 @@ namespace crucible {
public:
using Key = tuple<Arguments...>;
using Func = function<Return(Arguments...)>;
using Time = unsigned;
using Value = pair<Time, Return>;
private:
Func m_fn;
Time m_ctr;
map<Key, Value> m_map;
LockSet<Key> m_lockset;
size_t m_max_size;
mutex m_mutex;
struct Value {
Key key;
Return ret;
};
using ListIter = typename list<Value>::iterator;
Func m_fn;
list<Value> m_list;
map<Key, ListIter> m_map;
LockSet<Key> m_lockset;
size_t m_max_size;
mutable mutex m_mutex;
void check_overflow();
void recent_use(ListIter vp);
void erase_item(ListIter vp);
void erase_key(const Key &k);
Return insert_item(Func fn, Arguments... args);
public:
LRUCache(Func f = Func(), size_t max_size = 100);
@@ -37,43 +46,126 @@ namespace crucible {
Return operator()(Arguments... args);
Return refresh(Arguments... args);
void expire(Arguments... args);
void prune(function<bool(const Return &)> predicate);
void insert(const Return &r, Arguments... args);
void clear();
size_t size() const;
};
template <class Return, class... Arguments>
LRUCache<Return, Arguments...>::LRUCache(Func f, size_t max_size) :
m_fn(f),
m_ctr(0),
m_max_size(max_size)
{
}
template <class Return, class... Arguments>
Return
LRUCache<Return, Arguments...>::insert_item(Func fn, Arguments... args)
{
Key k(args...);
// Do we have it cached?
unique_lock<mutex> lock(m_mutex);
auto found = m_map.find(k);
if (found == m_map.end()) {
// No, release cache lock and acquire key lock
lock.unlock();
auto key_lock = m_lockset.make_lock(k);
// Did item appear in cache while we were waiting for key?
lock.lock();
found = m_map.find(k);
if (found == m_map.end()) {
// No, we now hold key and cache locks, but item not in cache.
// Release cache lock and call the function
lock.unlock();
// Create new value
Value v {
.key = k,
.ret = fn(args...),
};
// Reacquire cache lock
lock.lock();
// Make room
check_overflow();
// Insert return value at back of LRU list (hot end)
auto new_item = m_list.insert(m_list.end(), v);
// Insert return value in map
bool inserted = false;
tie(found, inserted) = m_map.insert(make_pair(v.key, new_item));
// We (should be) holding a lock on this key so we are the ones to insert it
THROW_CHECK0(runtime_error, inserted);
}
// Item should be in cache now
THROW_CHECK0(runtime_error, found != m_map.end());
} else {
// Move to end of LRU
recent_use(found->second);
}
// Return cached object
return found->second->ret;
}
template <class Return, class... Arguments>
void
LRUCache<Return, Arguments...>::erase_item(ListIter vp)
{
if (vp != m_list.end()) {
m_map.erase(vp->key);
m_list.erase(vp);
}
}
template <class Return, class... Arguments>
void
LRUCache<Return, Arguments...>::erase_key(const Key &k)
{
auto map_item = m_map.find(k);
if (map_item != m_map.end()) {
auto list_item = map_item->second;
m_map.erase(map_item);
m_list.erase(list_item);
}
}
template <class Return, class... Arguments>
void
LRUCache<Return, Arguments...>::check_overflow()
{
if (m_map.size() <= m_max_size) return;
vector<pair<Key, Time>> map_contents;
map_contents.reserve(m_map.size());
for (auto i : m_map) {
map_contents.push_back(make_pair(i.first, i.second.first));
}
sort(map_contents.begin(), map_contents.end(), [](const pair<Key, Time> &a, const pair<Key, Time> &b) {
return a.second < b.second;
});
for (size_t i = 0; i < map_contents.size() / 2; ++i) {
m_map.erase(map_contents[i].first);
// Erase items at front of LRU list (cold end) until max size reached or list empty
while (m_map.size() >= m_max_size && !m_list.empty()) {
erase_item(m_list.begin());
}
}
template <class Return, class... Arguments>
void
LRUCache<Return, Arguments...>::recent_use(ListIter vp)
{
// Splice existing items at back of LRU list (hot end)
auto next_vp = vp;
++next_vp;
m_list.splice(m_list.end(), m_list, vp, next_vp);
}
template <class Return, class... Arguments>
void
LRUCache<Return, Arguments...>::max_size(size_t new_max_size)
{
unique_lock<mutex> lock(m_mutex);
m_max_size = new_max_size;
// FIXME: this really reduces the cache size to new_max_size - 1
// because every other time we call this method, it is immediately
// followed by insert.
check_overflow();
}
@@ -89,80 +181,37 @@ namespace crucible {
void
LRUCache<Return, Arguments...>::clear()
{
// Move the map and list onto the stack, then destroy it after we've released the lock
// so that we don't block other threads if the list's destructors are expensive
decltype(m_list) new_list;
decltype(m_map) new_map;
unique_lock<mutex> lock(m_mutex);
m_map.clear();
m_list.swap(new_list);
m_map.swap(new_map);
lock.unlock();
}
template <class Return, class... Arguments>
void
LRUCache<Return, Arguments...>::prune(function<bool(const Return &)> pred)
size_t
LRUCache<Return, Arguments...>::size() const
{
unique_lock<mutex> lock(m_mutex);
for (auto it = m_map.begin(); it != m_map.end(); ) {
auto next_it = ++it;
if (pred(it.second.second)) {
m_map.erase(it);
}
it = next_it;
}
return m_map.size();
}
template<class Return, class... Arguments>
Return
LRUCache<Return, Arguments...>::operator()(Arguments... args)
{
Key k(args...);
bool inserted = false;
// Do we have it cached?
unique_lock<mutex> lock(m_mutex);
auto found = m_map.find(k);
if (found == m_map.end()) {
// No, release cache lock and acquire key lock
lock.unlock();
typename LockSet<Key>::Lock key_lock(m_lockset, k);
// Did item appear in cache while we were waiting for key?
lock.lock();
found = m_map.find(k);
if (found == m_map.end()) {
// No, we hold key and cache locks, but item not in cache.
// Release cache lock and call function
auto ctr_copy = m_ctr++;
lock.unlock();
Value v(ctr_copy, m_fn(args...));
// Reacquire cache lock and insert return value
lock.lock();
tie(found, inserted) = m_map.insert(make_pair(k, v));
// We hold a lock on this key so we are the ones to insert it
THROW_CHECK0(runtime_error, inserted);
// Release key lock and clean out overflow
key_lock.unlock();
check_overflow();
}
}
// Item should be in cache now
THROW_CHECK0(runtime_error, found != m_map.end());
// We are using this object so update the timestamp
if (!inserted) {
found->second.first = m_ctr++;
}
return found->second.second;
return insert_item(m_fn, args...);
}
template<class Return, class... Arguments>
void
LRUCache<Return, Arguments...>::expire(Arguments... args)
{
Key k(args...);
unique_lock<mutex> lock(m_mutex);
m_map.erase(k);
erase_key(Key(args...));
}
template<class Return, class... Arguments>
@@ -177,44 +226,7 @@ namespace crucible {
void
LRUCache<Return, Arguments...>::insert(const Return &r, Arguments... args)
{
Key k(args...);
bool inserted = false;
// Do we have it cached?
unique_lock<mutex> lock(m_mutex);
auto found = m_map.find(k);
if (found == m_map.end()) {
// No, release cache lock and acquire key lock
lock.unlock();
typename LockSet<Key>::Lock key_lock(m_lockset, k);
// Did item appear in cache while we were waiting for key?
lock.lock();
found = m_map.find(k);
if (found == m_map.end()) {
// No, we hold key and cache locks, but item not in cache.
// Release cache lock and insert the provided return value
auto ctr_copy = m_ctr++;
Value v(ctr_copy, r);
tie(found, inserted) = m_map.insert(make_pair(k, v));
// We hold a lock on this key so we are the ones to insert it
THROW_CHECK0(runtime_error, inserted);
// Release key lock and clean out overflow
key_lock.unlock();
check_overflow();
}
}
// Item should be in cache now
THROW_CHECK0(runtime_error, found != m_map.end());
// We are using this object so update the timestamp
if (!inserted) {
found->second.first = m_ctr++;
}
insert_item([&](Arguments...) -> Return { return r; }, args...);
}
}

View File

@@ -8,6 +8,8 @@
#include <string>
#include <typeinfo>
#include <syslog.h>
/** \brief Chatter wraps a std::ostream reference with a destructor that
writes a newline, and inserts timestamp, pid, and tid prefixes on output.
@@ -33,18 +35,22 @@ namespace crucible {
using namespace std;
class Chatter {
int m_loglevel;
string m_name;
ostream &m_os;
ostringstream m_oss;
public:
Chatter(string name, ostream &os = cerr);
Chatter(int loglevel, string name, ostream &os = cerr);
Chatter(Chatter &&c);
ostream &get_os() { return m_oss; }
template <class T> Chatter &operator<<(const T& arg);
~Chatter();
static void enable_timestamp(bool prefix_timestamp);
static void enable_level(bool prefix_level);
};
template <class Argument>
@@ -86,16 +92,6 @@ namespace crucible {
}
};
template <>
struct ChatterTraits<ostream &> {
Chatter &
operator()(Chatter &c, ostream & arg)
{
c.get_os() << arg;
return c;
}
};
class ChatterBox {
string m_file;
int m_line;
@@ -111,7 +107,7 @@ namespace crucible {
template <class T> Chatter operator<<(const T &t)
{
Chatter c(m_pretty_function, m_os);
Chatter c(LOG_NOTICE, m_pretty_function, m_os);
c << t;
return c;
}

113
include/crucible/city.h Normal file
View File

@@ -0,0 +1,113 @@
// Copyright (c) 2011 Google, Inc.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//
// CityHash, by Geoff Pike and Jyrki Alakuijala
//
// http://code.google.com/p/cityhash/
//
// This file provides a few functions for hashing strings. All of them are
// high-quality functions in the sense that they pass standard tests such
// as Austin Appleby's SMHasher. They are also fast.
//
// For 64-bit x86 code, on short strings, we don't know of anything faster than
// CityHash64 that is of comparable quality. We believe our nearest competitor
// is Murmur3. For 64-bit x86 code, CityHash64 is an excellent choice for hash
// tables and most other hashing (excluding cryptography).
//
// For 64-bit x86 code, on long strings, the picture is more complicated.
// On many recent Intel CPUs, such as Nehalem, Westmere, Sandy Bridge, etc.,
// CityHashCrc128 appears to be faster than all competitors of comparable
// quality. CityHash128 is also good but not quite as fast. We believe our
// nearest competitor is Bob Jenkins' Spooky. We don't have great data for
// other 64-bit CPUs, but for long strings we know that Spooky is slightly
// faster than CityHash on some relatively recent AMD x86-64 CPUs, for example.
// Note that CityHashCrc128 is declared in citycrc.h [which has been removed
// for bees].
//
// For 32-bit x86 code, we don't know of anything faster than CityHash32 that
// is of comparable quality. We believe our nearest competitor is Murmur3A.
// (On 64-bit CPUs, it is typically faster to use the other CityHash variants.)
//
// Functions in the CityHash family are not suitable for cryptography.
//
// Please see CityHash's README file for more details on our performance
// measurements and so on.
//
// WARNING: This code has been only lightly tested on big-endian platforms!
// It is known to work well on little-endian platforms that have a small penalty
// for unaligned reads, such as current Intel and AMD moderate-to-high-end CPUs.
// It should work on all 32-bit and 64-bit platforms that allow unaligned reads;
// bug reports are welcome.
//
// By the way, for some hash functions, given strings a and b, the hash
// of a+b is easily derived from the hashes of a and b. This property
// doesn't hold for any hash functions in this file.
#ifndef CITY_HASH_H_
#define CITY_HASH_H_
#include <stdlib.h> // for size_t.
#include <stdint.h>
#include <utility>
typedef uint8_t uint8;
typedef uint32_t uint32;
typedef uint64_t uint64;
typedef std::pair<uint64, uint64> uint128;
inline uint64 Uint128Low64(const uint128& x) { return x.first; }
inline uint64 Uint128High64(const uint128& x) { return x.second; }
// Hash function for a byte array.
uint64 CityHash64(const char *buf, size_t len);
// Hash function for a byte array. For convenience, a 64-bit seed is also
// hashed into the result.
uint64 CityHash64WithSeed(const char *buf, size_t len, uint64 seed);
// Hash function for a byte array. For convenience, two seeds are also
// hashed into the result.
uint64 CityHash64WithSeeds(const char *buf, size_t len,
uint64 seed0, uint64 seed1);
// Hash function for a byte array.
uint128 CityHash128(const char *s, size_t len);
// Hash function for a byte array. For convenience, a 128-bit seed is also
// hashed into the result.
uint128 CityHash128WithSeed(const char *s, size_t len, uint128 seed);
// Hash function for a byte array. Most useful in 32-bit binaries.
uint32 CityHash32(const char *buf, size_t len);
// Hash 128 input bits down to 64 bits of output.
// This is intended to be a reasonably good hash function.
inline uint64 Hash128to64(const uint128& x) {
// Murmur-inspired hashing.
const uint64 kMul = 0x9ddfea08eb382d69ULL;
uint64 a = (Uint128Low64(x) ^ Uint128High64(x)) * kMul;
a ^= (a >> 47);
uint64 b = (Uint128High64(x) ^ a) * kMul;
b ^= (b >> 47);
b *= kMul;
return b;
}
#endif // CITY_HASH_H_

View File

@@ -0,0 +1,18 @@
#ifndef CRUCIBLE_CLEANUP_H
#define CRUCIBLE_CLEANUP_H
#include <functional>
namespace crucible {
using namespace std;
class Cleanup {
function<void()> m_cleaner;
public:
Cleanup(function<void()> func);
~Cleanup();
};
}
#endif // CRUCIBLE_CLEANUP_H

View File

@@ -3,11 +3,11 @@
#include <cstdint>
#include <cstdlib>
#include <cstring>
namespace crucible {
namespace Digest {
namespace CRC {
uint64_t crc64(const char *s);
uint64_t crc64(const void *p, size_t len);
};
};

58
include/crucible/endian.h Normal file
View File

@@ -0,0 +1,58 @@
#ifndef CRUCIBLE_ENDIAN_H
#define CRUCIBLE_ENDIAN_H
#include <cstdint>
#include <endian.h>
namespace crucible {
template<class T>
struct le_to_cpu_helper {
T operator()(const T v);
};
template<> struct le_to_cpu_helper<uint64_t> {
uint64_t operator()(const uint64_t v) { return le64toh(v); }
};
#if __SIZEOF_LONG__ == 8
// uint64_t is unsigned long on LP64 platforms
template<> struct le_to_cpu_helper<unsigned long long> {
unsigned long long operator()(const unsigned long long v) { return le64toh(v); }
};
#endif
template<> struct le_to_cpu_helper<uint32_t> {
uint32_t operator()(const uint32_t v) { return le32toh(v); }
};
template<> struct le_to_cpu_helper<uint16_t> {
uint16_t operator()(const uint16_t v) { return le16toh(v); }
};
template<> struct le_to_cpu_helper<uint8_t> {
uint8_t operator()(const uint8_t v) { return v; }
};
template<class T>
T
le_to_cpu(const T v)
{
return le_to_cpu_helper<T>()(v);
}
template<class T>
T
get_unaligned(const void *const p)
{
struct not_aligned {
T v;
} __attribute__((packed));
const not_aligned *const nap = reinterpret_cast<const not_aligned*>(p);
return nap->v;
}
}
#endif // CRUCIBLE_ENDIAN_H

View File

@@ -81,31 +81,25 @@ namespace crucible {
// macro for throwing an error
#define THROW_ERROR(type, expr) do { \
std::ostringstream _te_oss; \
_te_oss << expr; \
_te_oss << expr << " at " << __FILE__ << ":" << __LINE__; \
throw type(_te_oss.str()); \
} while (0)
// macro for throwing a system_error with errno
#define THROW_ERRNO(expr) do { \
std::ostringstream _te_oss; \
_te_oss << expr; \
_te_oss << expr << " at " << __FILE__ << ":" << __LINE__; \
throw std::system_error(std::error_code(errno, std::system_category()), _te_oss.str()); \
} while (0)
// macro for throwing a system_error with some other variable
#define THROW_ERRNO_VALUE(value, expr) do { \
std::ostringstream _te_oss; \
_te_oss << expr; \
_te_oss << expr << " at " << __FILE__ << ":" << __LINE__; \
throw std::system_error(std::error_code((value), std::system_category()), _te_oss.str()); \
} while (0)
// macros for checking a constraint
#define CHECK_CONSTRAINT(value, expr) do { \
if (!(expr)) { \
THROW_ERROR(out_of_range, #value << " = " << value << " failed constraint check (" << #expr << ")"); \
} \
} while(0)
#define THROW_CHECK0(type, expr) do { \
if (!(expr)) { \
THROW_ERROR(type, "failed constraint check (" << #expr << ")"); \
@@ -132,6 +126,13 @@ namespace crucible {
} \
} while(0)
#define THROW_CHECK4(type, value1, value2, value3, value4, expr) do { \
if (!(expr)) { \
THROW_ERROR(type, #value1 << " = " << (value1) << ", " #value2 << " = " << (value2) << ", " #value3 << " = " << (value3) << ", " #value4 << " = " << (value4) \
<< " failed constraint check (" << #expr << ")"); \
} \
} while(0)
#define THROW_CHECK_BIN_OP(type, value1, op, value2) do { \
if (!((value1) op (value2))) { \
THROW_ERROR(type, "failed constraint check " << #value1 << " (" << (value1) << ") " << #op << " " << #value2 << " (" << (value2) << ")"); \

View File

@@ -1,28 +0,0 @@
#ifndef CRUCIBLE_EXECPIPE_H
#define CRUCIBLE_EXECPIPE_H
#include "crucible/fd.h"
#include <functional>
#include <limits>
#include <string>
namespace crucible {
using namespace std;
void redirect_stdin(const Fd &child_fd);
void redirect_stdin_stdout(const Fd &child_fd);
void redirect_stdin_stdout_stderr(const Fd &child_fd);
void redirect_stdout(const Fd &child_fd);
void redirect_stdout_stderr(const Fd &child_fd);
// Open a pipe (actually socketpair) to child process, then execute code in that process.
// e.g. popen([] () { system("echo Hello, World!"); });
// Forked process will exit when function returns.
Fd popen(function<int()> f, function<void(const Fd &child_fd)> import_fd_fn = redirect_stdin_stdout);
// Read all the data from fd into a string
string read_all(Fd fd, size_t max_bytes = numeric_limits<size_t>::max(), size_t chunk_bytes = 4096);
};
#endif // CRUCIBLE_EXECPIPE_H

View File

@@ -8,15 +8,15 @@ namespace crucible {
// FIXME: ExtentCursor is probably a better name
struct Extent {
off_t m_begin;
off_t m_end;
uint64_t m_physical;
uint64_t m_flags;
off_t m_begin = 0;
off_t m_end = 0;
uint64_t m_physical = 0;
uint64_t m_flags = 0;
// Btrfs extent reference details
off_t m_physical_len;
off_t m_logical_len;
off_t m_offset;
off_t m_physical_len = 0;
off_t m_logical_len = 0;
off_t m_offset = 0;
// fiemap flags are uint32_t, so bits 32..63 are OK for us
@@ -38,11 +38,10 @@ namespace crucible {
off_t physical_len() const { return m_physical_len; }
off_t logical_len() const { return m_logical_len; }
off_t offset() const { return m_offset; }
bool compressed() const;
uint64_t bytenr() const;
bool operator==(const Extent &that) const;
bool operator!=(const Extent &that) const { return !(*this == that); }
Extent();
Extent(const Extent &e) = default;
};
class ExtentWalker {
@@ -56,10 +55,6 @@ namespace crucible {
virtual Vec get_extent_map(off_t pos);
static const unsigned sc_extent_fetch_max = 64;
static const unsigned sc_extent_fetch_min = 4;
static const off_t sc_step_size = 0x1000 * (sc_extent_fetch_max / 2);
private:
Vec m_extents;
Itr m_current;
@@ -67,6 +62,10 @@ namespace crucible {
Itr find_in_cache(off_t pos);
void run_fiemap(off_t pos);
#ifdef EXTENTWALKER_DEBUG
ostringstream m_log;
#endif
public:
ExtentWalker(Fd fd = Fd());
ExtentWalker(Fd fd, off_t initial_pos);

View File

@@ -1,7 +1,8 @@
#ifndef CRUCIBLE_FD_H
#define CRUCIBLE_FD_H
#include "crucible/resource.h"
#include "crucible/bytevector.h"
#include "crucible/namedptr.h"
#include <cstring>
@@ -13,6 +14,10 @@
#include <sys/stat.h>
#include <fcntl.h>
// ioctl
#include <sys/ioctl.h>
#include <linux/fs.h>
// socket
#include <sys/socket.h>
@@ -22,76 +27,91 @@
namespace crucible {
using namespace std;
// IOHandle is a file descriptor owner object. It closes them when destroyed.
// Most of the functions here don't use it because these functions don't own FDs.
// All good names for such objects are taken.
/// File descriptor owner object. It closes them when destroyed.
/// Most of the functions here don't use it because these functions don't own FDs.
/// All good names for such objects are taken.
class IOHandle {
IOHandle(const IOHandle &) = delete;
IOHandle(IOHandle &&) = delete;
IOHandle& operator=(IOHandle &&) = delete;
IOHandle& operator=(const IOHandle &) = delete;
protected:
int m_fd;
IOHandle& operator=(int that) { m_fd = that; return *this; }
void close();
public:
virtual ~IOHandle();
IOHandle(int fd);
IOHandle();
void close();
int get_fd() const { return m_fd; }
int release_fd();
IOHandle(int fd = -1);
int get_fd() const;
};
template <>
struct ResourceTraits<int, IOHandle> {
int get_key(const IOHandle &res) const { return res.get_fd(); }
shared_ptr<IOHandle> make_resource(int fd) const { return make_shared<IOHandle>(fd); }
bool is_null_key(const int &key) const { return key < 0; }
int get_null_key() const { return -1; }
};
/// Copyable file descriptor.
class Fd {
static NamedPtr<IOHandle, int> s_named_ptr;
shared_ptr<IOHandle> m_handle;
public:
using resource_type = IOHandle;
Fd();
Fd(int fd);
Fd &operator=(int fd);
Fd &operator=(const shared_ptr<IOHandle> &);
operator int() const;
bool operator!() const;
shared_ptr<IOHandle> operator->() const;
};
typedef ResourceHandle<int, IOHandle> Fd;
void set_relative_path(string path);
string relative_path();
// Functions named "foo_or_die" throw exceptions on failure.
// Attempt to open the file with the given mode
/// Attempt to open the file with the given mode, throw exception on failure.
int open_or_die(const string &file, int flags = O_RDONLY, mode_t mode = 0777);
/// Attempt to open the file with the given mode, throw exception on failure.
int openat_or_die(int dir_fd, const string &file, int flags = O_RDONLY, mode_t mode = 0777);
// Decode open parameters
/// Decode open flags
string o_flags_ntoa(int flags);
/// Decode open mode
string o_mode_ntoa(mode_t mode);
// mmap with its one weird error case
/// mmap with its one weird error case
void *mmap_or_die(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
// Decode mmap parameters
/// Decode mmap prot
string mmap_prot_ntoa(int prot);
/// Decode mmap flags
string mmap_flags_ntoa(int flags);
// Unlink, rename
void unlink_or_die(const string &file);
/// Rename, throw exception on failure.
void rename_or_die(const string &from, const string &to);
/// Rename, throw exception on failure.
void renameat_or_die(int fromfd, const string &frompath, int tofd, const string &topath);
/// Truncate, throw exception on failure.
void ftruncate_or_die(int fd, off_t size);
// Read or write structs:
// There is a template specialization to read or write strings
// Three-arg version of read_or_die/write_or_die throws an error on incomplete read/writes
// Four-arg version returns number of bytes read/written through reference arg
/// Attempt read by pointer and length, throw exception on IO error or short read.
void read_or_die(int fd, void *buf, size_t size);
/// Attempt read of a POD struct, throw exception on IO error or short read.
template <class T> void read_or_die(int fd, T& buf)
{
return read_or_die(fd, static_cast<void *>(&buf), sizeof(buf));
}
/// Attempt read by pointer and length, throw exception on IO error but not short read.
void read_partial_or_die(int fd, void *buf, size_t size_wanted, size_t &size_read);
/// Attempt read of a POD struct, throw exception on IO error but not short read.
template <class T> void read_partial_or_die(int fd, T& buf, size_t &size_read)
{
return read_partial_or_die(fd, static_cast<void *>(&buf), sizeof(buf), size_read);
}
/// Attempt read at position by pointer and length, throw exception on IO error but not short read.
void pread_or_die(int fd, void *buf, size_t size, off_t offset);
/// Attempt read at position of a POD struct, throw exception on IO error but not short read.
template <class T> void pread_or_die(int fd, T& buf, off_t offset)
{
return pread_or_die(fd, static_cast<void *>(&buf), sizeof(buf), offset);
@@ -118,17 +138,23 @@ namespace crucible {
// Specialization for strings which reads/writes the string content, not the struct string
template<> void write_or_die<string>(int fd, const string& str);
template<> void pread_or_die<string>(int fd, string& str, off_t offset);
template<> void pread_or_die<vector<char>>(int fd, vector<char>& str, off_t offset);
template<> void pread_or_die<vector<uint8_t>>(int fd, vector<uint8_t>& str, off_t offset);
template<> void pwrite_or_die<string>(int fd, const string& str, off_t offset);
template<> void pread_or_die<ByteVector>(int fd, ByteVector& str, off_t offset);
template<> void pwrite_or_die<ByteVector>(int fd, const ByteVector& str, off_t offset);
// Deprecated
template<> void pread_or_die<vector<uint8_t>>(int fd, vector<uint8_t>& str, off_t offset) = delete;
template<> void pwrite_or_die<vector<uint8_t>>(int fd, const vector<uint8_t>& str, off_t offset) = delete;
template<> void pread_or_die<vector<char>>(int fd, vector<char>& str, off_t offset) = delete;
template<> void pwrite_or_die<vector<char>>(int fd, const vector<char>& str, off_t offset) = delete;
// A different approach to reading a simple string
/// Read a simple string.
string read_string(int fd, size_t size);
// A lot of Unix API wants you to initialize a struct and call
// one function to fill it, another function to throw it away,
// and has some unknown third thing you have to do when there's
// an error. That's also a C++ object with an exception-throwing
// constructor.
/// A lot of Unix API wants you to initialize a struct and call
/// one function to fill it, another function to throw it away,
/// and has some unknown third thing you have to do when there's
/// an error. That's also a C++ object with an exception-throwing
/// constructor.
struct Stat : public stat {
Stat();
Stat(int f);
@@ -137,19 +163,22 @@ namespace crucible {
Stat &lstat(const string &filename);
};
int ioctl_iflags_get(int fd);
void ioctl_iflags_set(int fd, int attr);
string st_mode_ntoa(mode_t mode);
// Because it's not trivial to do correctly
/// Because it's not trivial to do correctly
string readlink_or_die(const string &path);
// Determine the name of a FD by readlink through /proc/self/fd/
/// Determine the name of a FD by readlink through /proc/self/fd/
string name_fd(int fd);
// Returns Fd objects because it does own them.
/// Returns Fd objects because it does own them.
pair<Fd, Fd> socketpair_or_die(int domain = AF_UNIX, int type = SOCK_STREAM, int protocol = 0);
// like unique_lock but for flock instead of mutexes...and not trying
// to hide the many and subtle differences between those two things *at all*.
/// like unique_lock but for flock instead of mutexes...and not trying
/// to hide the many and subtle differences between those two things *at all*.
class Flock {
int m_fd;
bool m_locked;
@@ -170,7 +199,7 @@ namespace crucible {
int fd();
};
// Doesn't use Fd objects because it's usually just used to replace stdin/stdout/stderr.
/// Doesn't use Fd objects because it's usually just used to replace stdin/stdout/stderr.
void dup2_or_die(int fd_in, int fd_out);
}

View File

@@ -1,6 +1,8 @@
#ifndef CRUCIBLE_FS_H
#define CRUCIBLE_FS_H
#include "crucible/bytevector.h"
#include "crucible/endian.h"
#include "crucible/error.h"
// Terribly Linux-specific FS-wrangling functions
@@ -13,6 +15,7 @@
#include <cstdint>
#include <iosfwd>
#include <set>
#include <vector>
#include <fcntl.h>
@@ -24,23 +27,16 @@ namespace crucible {
// wrapper around fallocate(...FALLOC_FL_PUNCH_HOLE...)
void punch_hole(int fd, off_t offset, off_t len);
struct BtrfsExtentInfo : public btrfs_ioctl_same_extent_info {
BtrfsExtentInfo(int dst_fd, off_t dst_offset);
};
struct BtrfsExtentSame : public btrfs_ioctl_same_args {
struct BtrfsExtentSame {
virtual ~BtrfsExtentSame();
BtrfsExtentSame(int src_fd, off_t src_offset, off_t src_length);
void add(int fd, off_t offset);
void add(int fd, uint64_t offset);
virtual void do_ioctl();
uint64_t m_logical_offset = 0;
uint64_t m_length = 0;
int m_fd;
vector<BtrfsExtentInfo> m_info;
};
struct BtrfsExtentSameByClone : public BtrfsExtentSame {
using BtrfsExtentSame::BtrfsExtentSame;
void do_ioctl() override;
vector<btrfs_ioctl_same_extent_info> m_info;
};
ostream & operator<<(ostream &os, const btrfs_ioctl_same_extent_info *info);
@@ -55,26 +51,51 @@ namespace crucible {
ostream & operator<<(ostream &os, const BtrfsInodeOffsetRoot &p);
struct BtrfsDataContainer : public btrfs_data_container {
struct BtrfsDataContainer {
BtrfsDataContainer(size_t size = 64 * 1024);
void *prepare();
void *prepare(size_t size);
size_t get_size() const;
decltype(bytes_left) get_bytes_left() const;
decltype(bytes_missing) get_bytes_missing() const;
decltype(elem_cnt) get_elem_cnt() const;
decltype(elem_missed) get_elem_missed() const;
decltype(btrfs_data_container::bytes_left) get_bytes_left() const;
decltype(btrfs_data_container::bytes_missing) get_bytes_missing() const;
decltype(btrfs_data_container::elem_cnt) get_elem_cnt() const;
decltype(btrfs_data_container::elem_missed) get_elem_missed() const;
vector<char> m_data;
ByteVector m_data;
};
struct BtrfsIoctlLogicalInoArgs : public btrfs_ioctl_logical_ino_args {
BtrfsIoctlLogicalInoArgs(uint64_t logical, size_t buf_size = 64 * 1024);
virtual void do_ioctl(int fd);
virtual bool do_ioctl_nothrow(int fd);
struct BtrfsIoctlLogicalInoArgs {
BtrfsIoctlLogicalInoArgs(uint64_t logical, size_t buf_size = 16 * 1024 * 1024);
uint64_t get_flags() const;
void set_flags(uint64_t new_flags);
void set_logical(uint64_t new_logical);
void set_size(uint64_t new_size);
void do_ioctl(int fd);
bool do_ioctl_nothrow(int fd);
struct BtrfsInodeOffsetRootSpan {
using iterator = BtrfsInodeOffsetRoot*;
using const_iterator = const BtrfsInodeOffsetRoot*;
size_t size() const;
iterator begin() const;
iterator end() const;
const_iterator cbegin() const;
const_iterator cend() const;
iterator data() const;
void clear();
private:
iterator m_begin = nullptr;
iterator m_end = nullptr;
friend struct BtrfsIoctlLogicalInoArgs;
} m_iors;
private:
size_t m_container_size;
BtrfsDataContainer m_container;
vector<BtrfsInodeOffsetRoot> m_iors;
uint64_t m_logical;
uint64_t m_flags = 0;
friend ostream & operator<<(ostream &os, const BtrfsIoctlLogicalInoArgs *p);
};
ostream & operator<<(ostream &os, const BtrfsIoctlLogicalInoArgs &p);
@@ -84,7 +105,7 @@ namespace crucible {
virtual void do_ioctl(int fd);
virtual bool do_ioctl_nothrow(int fd);
BtrfsDataContainer m_container;
size_t m_container_size;
vector<string> m_paths;
};
@@ -106,15 +127,6 @@ namespace crucible {
ostream & operator<<(ostream &os, const BtrfsIoctlDefragRangeArgs *p);
// in btrfs/ctree.h, but that's a nightmare to #include here
typedef enum {
BTRFS_COMPRESS_NONE = 0,
BTRFS_COMPRESS_ZLIB = 1,
BTRFS_COMPRESS_LZO = 2,
BTRFS_COMPRESS_TYPES = 2,
BTRFS_COMPRESS_LAST = 3,
} btrfs_compression_type;
struct FiemapExtent : public fiemap_extent {
FiemapExtent();
FiemapExtent(const fiemap_extent &that);
@@ -123,16 +135,26 @@ namespace crucible {
off_t end() const;
};
struct Fiemap : public fiemap {
struct Fiemap {
// because fiemap.h insists on giving FIEMAP_MAX_OFFSET
// a different type from the struct fiemap members
static const uint64_t s_fiemap_max_offset = FIEMAP_MAX_OFFSET;
// Get entire file
Fiemap(uint64_t start = 0, uint64_t length = FIEMAP_MAX_OFFSET);
Fiemap(uint64_t start = 0, uint64_t length = s_fiemap_max_offset);
void do_ioctl(int fd);
vector<FiemapExtent> m_extents;
uint64_t m_min_count = (4096 - sizeof(fiemap)) / sizeof(fiemap_extent);
uint64_t m_max_count = 16 * 1024 * 1024 / sizeof(fiemap_extent);
decltype(fiemap::fm_extent_count) m_min_count = (4096 - sizeof(fiemap)) / sizeof(fiemap_extent);
decltype(fiemap::fm_extent_count) m_max_count = 16 * 1024 * 1024 / sizeof(fiemap_extent);
uint64_t m_start;
uint64_t m_length;
// FIEMAP is slow and full of lies.
// This makes FIEMAP even slower, but reduces the lies a little.
decltype(fiemap::fm_flags) m_flags = FIEMAP_FLAG_SYNC;
friend ostream &operator<<(ostream &, const Fiemap &);
};
ostream & operator<<(ostream &os, const fiemap_extent *info);
@@ -148,79 +170,70 @@ namespace crucible {
struct BtrfsIoctlSearchHeader : public btrfs_ioctl_search_header {
BtrfsIoctlSearchHeader();
vector<char> m_data;
size_t set_data(const vector<char> &v, size_t offset);
ByteVector m_data;
size_t set_data(const ByteVector &v, size_t offset);
bool operator<(const BtrfsIoctlSearchHeader &that) const;
};
// Perf blames this function for a few percent overhead; move it here so it can be inline
inline bool BtrfsIoctlSearchHeader::operator<(const BtrfsIoctlSearchHeader &that) const
{
return tie(objectid, type, offset, len, transid) < tie(that.objectid, that.type, that.offset, that.len, that.transid);
}
ostream & operator<<(ostream &os, const btrfs_ioctl_search_header &hdr);
ostream & operator<<(ostream &os, const BtrfsIoctlSearchHeader &hdr);
struct BtrfsIoctlSearchKey : public btrfs_ioctl_search_key {
BtrfsIoctlSearchKey(size_t buf_size = 1024 * 1024);
virtual bool do_ioctl_nothrow(int fd);
virtual void do_ioctl(int fd);
BtrfsIoctlSearchKey(size_t buf_size = 1024);
bool do_ioctl_nothrow(int fd);
void do_ioctl(int fd);
// Copy objectid/type/offset so we move forward
void next_min(const BtrfsIoctlSearchHeader& ref);
// move forward to next object of a single type
void next_min(const BtrfsIoctlSearchHeader& ref, const uint8_t type);
size_t m_buf_size;
vector<BtrfsIoctlSearchHeader> m_result;
set<BtrfsIoctlSearchHeader> m_result;
static thread_local size_t s_calls;
static thread_local size_t s_loops;
static thread_local size_t s_loops_empty;
static thread_local shared_ptr<ostream> s_debug_ostream;
};
ostream & operator<<(ostream &os, const btrfs_ioctl_search_key &key);
ostream & operator<<(ostream &os, const BtrfsIoctlSearchKey &key);
string btrfs_chunk_type_ntoa(uint64_t type);
string btrfs_search_type_ntoa(unsigned type);
string btrfs_search_objectid_ntoa(unsigned objectid);
string btrfs_search_objectid_ntoa(uint64_t objectid);
string btrfs_compress_type_ntoa(uint8_t type);
uint64_t btrfs_get_root_id(int fd);
uint64_t btrfs_get_root_transid(int fd);
template<class T>
template<class T, class V>
const T*
get_struct_ptr(vector<char> &v, size_t offset = 0)
get_struct_ptr(const V &v, size_t offset = 0)
{
// OK so sometimes btrfs overshoots a little
if (offset + sizeof(T) > v.size()) {
v.resize(offset + sizeof(T), 0);
}
THROW_CHECK2(invalid_argument, v.size(), offset + sizeof(T), offset + sizeof(T) <= v.size());
return reinterpret_cast<const T*>(v.data() + offset);
THROW_CHECK2(out_of_range, v.size(), offset + sizeof(T), offset + sizeof(T) <= v.size());
const uint8_t *const data_ptr = v.data();
return reinterpret_cast<const T*>(data_ptr + offset);
}
template<class A, class R>
R
call_btrfs_get(R (*func)(const A*), vector<char> &v, size_t offset = 0)
{
return func(get_struct_ptr<A>(v, offset));
}
template <class T> struct btrfs_get_le;
template<> struct btrfs_get_le<__le64> {
uint64_t operator()(const void *p) { return get_unaligned_le64(p); }
};
template<> struct btrfs_get_le<__le32> {
uint32_t operator()(const void *p) { return get_unaligned_le32(p); }
};
template<> struct btrfs_get_le<__le16> {
uint16_t operator()(const void *p) { return get_unaligned_le16(p); }
};
template<> struct btrfs_get_le<__le8> {
uint8_t operator()(const void *p) { return get_unaligned_le8(p); }
};
template<class S, class T>
template<class S, class T, class V>
T
btrfs_get_member(T S::* member, vector<char> &v, size_t offset = 0)
btrfs_get_member(T S::* member, V &v, size_t offset = 0)
{
const S *sp = reinterpret_cast<const S*>(NULL);
const T *spm = &(sp->*member);
auto member_offset = reinterpret_cast<const char *>(spm) - reinterpret_cast<const char *>(sp);
return btrfs_get_le<T>()(get_struct_ptr<S>(v, offset + member_offset));
const S *const sp = nullptr;
const T *const spm = &(sp->*member);
const auto member_offset = reinterpret_cast<const uint8_t *>(spm) - reinterpret_cast<const uint8_t *>(sp);
const void *struct_ptr = get_struct_ptr<T>(v, offset + member_offset);
const T unaligned_t = get_unaligned<T>(struct_ptr);
return le_to_cpu(unaligned_t);
}
struct Statvfs : public statvfs {
@@ -232,12 +245,14 @@ namespace crucible {
unsigned long available() const;
};
ostream &hexdump(ostream &os, const vector<char> &v);
struct BtrfsIoctlFsInfoArgs : public btrfs_ioctl_fs_info_args {
struct BtrfsIoctlFsInfoArgs : public btrfs_ioctl_fs_info_args_v3 {
BtrfsIoctlFsInfoArgs();
void do_ioctl(int fd);
string uuid() const;
bool do_ioctl_nothrow(int fd);
uint16_t csum_type() const;
uint16_t csum_size() const;
uint64_t generation() const;
vector<uint8_t> fsid() const;
};
ostream & operator<<(ostream &os, const BtrfsIoctlFsInfoArgs &a);

View File

@@ -0,0 +1,38 @@
#ifndef CRUCIBLE_HEXDUMP_H
#define CRUCIBLE_HEXDUMP_H
#include "crucible/string.h"
#include <ostream>
namespace crucible {
using namespace std;
template <class V>
ostream &
hexdump(ostream &os, const V &v)
{
const auto v_size = v.size();
const uint8_t* const v_data = reinterpret_cast<const uint8_t*>(v.data());
os << "V { size = " << v_size << ", data:\n";
for (size_t i = 0; i < v_size; i += 8) {
string hex, ascii;
for (size_t j = i; j < i + 8; ++j) {
if (j < v_size) {
const uint8_t c = v_data[j];
char buf[8];
sprintf(buf, "%02x ", c);
hex += buf;
ascii += (c < 32 || c > 126) ? '.' : c;
} else {
hex += " ";
ascii += ' ';
}
}
os << astringprintf("\t%08x %s %s\n", i, hex.c_str(), ascii.c_str());
}
return os << "}";
}
};
#endif // CRUCIBLE_HEXDUMP_H

View File

@@ -1,106 +0,0 @@
#ifndef CRUCIBLE_INTERP_H
#define CRUCIBLE_INTERP_H
#include "crucible/error.h"
#include <map>
#include <memory>
#include <string>
#include <vector>
namespace crucible {
using namespace std;
struct ArgList : public vector<string> {
ArgList(const char **argv);
// using vector<string>::vector ... doesn't work:
// error: std::vector<std::basic_string<char> >::vector names constructor
// Still doesn't work in 4.9 because it can't manage a conversion
ArgList(const vector<string> &&that);
};
struct ArgActor {
struct ArgActorBase {
virtual void predicate(void *obj, string arg);
};
template <class T>
struct ArgActorDerived {
function<void(T, string)> m_func;
ArgActorDerived(decltype(m_func) func) :
m_func(func)
{
}
void predicate(void *obj, string arg) override
{
T &op = *(reinterpret_cast<T*>(obj));
m_func(op, obj);
}
};
template <class T>
ArgActor(T, function<void(T, string)> func) :
m_actor(make_shared(ArgActorDerived<T>(func)))
{
}
ArgActor() = default;
void predicate(void *t, string arg)
{
if (m_actor) {
m_actor->predicate(t, arg);
} else {
THROW_ERROR(invalid_argument, "null m_actor for predicate arg '" << arg << "'");
}
}
private:
shared_ptr<ArgActorBase> m_actor;
};
struct ArgParser {
~ArgParser();
ArgParser();
void add_opt(string opt, ArgActor actor);
template <class T>
void
parse(T t, const ArgList &args)
{
void *vt = &t;
parse_backend(vt, args);
}
private:
void parse_backend(void *t, const ArgList &args);
map<string, ArgActor> m_string_opts;
};
struct Command {
virtual ~Command();
virtual int exec(const ArgList &args) = 0;
};
struct Proc : public Command {
int exec(const ArgList &args) override;
Proc(const function<int(const ArgList &)> &f);
private:
function<int(const ArgList &)> m_cmd;
};
struct Interp {
virtual ~Interp();
Interp(const map<string, shared_ptr<Command> > &cmdlist);
void add_command(const string &name, const shared_ptr<Command> &command);
int exec(const ArgList &args);
private:
Interp(const Interp &) = delete;
map<string, shared_ptr<Command> > m_commands;
};
};
#endif // CRUCIBLE_INTERP_H

View File

@@ -1,14 +1,17 @@
#ifndef CRUCIBLE_LOCKSET_H
#define CRUCIBLE_LOCKSET_H
#include <crucible/error.h>
#include "crucible/error.h"
#include "crucible/process.h"
#include <cassert>
#include <condition_variable>
#include <iostream>
#include <limits>
#include <map>
#include <memory>
#include <mutex>
#include <set>
namespace crucible {
using namespace std;
@@ -17,14 +20,36 @@ namespace crucible {
class LockSet {
public:
using key_type = T;
using set_type = set<T>;
using set_type = map<T, pid_t>;
using key_type = typename set_type::key_type;
private:
set_type m_set;
mutex m_mutex;
condition_variable m_condvar;
size_t m_max_size = numeric_limits<size_t>::max();
bool full();
bool locked(const key_type &name);
class Lock {
LockSet &m_lockset;
key_type m_name;
bool m_locked;
Lock() = delete;
Lock(const Lock &) = delete;
Lock& operator=(const Lock &) = delete;
Lock(Lock &&that) = delete;
Lock& operator=(Lock &&that) = delete;
public:
~Lock();
Lock(LockSet &lockset, const key_type &name, bool start_locked = true);
void lock();
void unlock();
bool try_lock();
};
public:
~LockSet();
@@ -36,26 +61,21 @@ namespace crucible {
size_t size();
bool empty();
set_type copy();
void wait_unlock(double interval);
class Lock {
LockSet &m_lockset;
key_type m_name;
bool m_locked;
void max_size(size_t max);
class LockHandle {
shared_ptr<Lock> m_lock;
Lock() = delete;
Lock(const Lock &) = delete;
Lock& operator=(const Lock &) = delete;
public:
~Lock();
Lock(LockSet &lockset, const key_type &m_name, bool start_locked = true);
Lock(Lock &&that);
Lock& operator=(Lock &&that);
void lock();
void unlock();
bool try_lock();
LockHandle(LockSet &lockset, const key_type &name, bool start_locked = true) :
m_lock(make_shared<Lock>(lockset, name, start_locked)) {}
void lock() { m_lock->lock(); }
void unlock() { m_lock->unlock(); }
bool try_lock() { return m_lock->try_lock(); }
};
LockHandle make_lock(const key_type &name, bool start_locked = true);
};
template <class T>
@@ -68,15 +88,36 @@ namespace crucible {
assert(m_set.empty());
}
template <class T>
bool
LockSet<T>::full()
{
return m_set.size() >= m_max_size;
}
template <class T>
bool
LockSet<T>::locked(const key_type &name)
{
return m_set.count(name);
}
template <class T>
void
LockSet<T>::max_size(size_t s)
{
m_max_size = s;
}
template <class T>
void
LockSet<T>::lock(const key_type &name)
{
unique_lock<mutex> lock(m_mutex);
while (m_set.count(name)) {
while (full() || locked(name)) {
m_condvar.wait(lock);
}
auto rv = m_set.insert(name);
auto rv = m_set.insert(make_pair(name, gettid()));
THROW_CHECK0(runtime_error, rv.second);
}
@@ -85,10 +126,10 @@ namespace crucible {
LockSet<T>::try_lock(const key_type &name)
{
unique_lock<mutex> lock(m_mutex);
if (m_set.count(name)) {
if (full() || locked(name)) {
return false;
}
auto rv = m_set.insert(name);
auto rv = m_set.insert(make_pair(name, gettid()));
THROW_CHECK1(runtime_error, name, rv.second);
return true;
}
@@ -98,20 +139,11 @@ namespace crucible {
LockSet<T>::unlock(const key_type &name)
{
unique_lock<mutex> lock(m_mutex);
m_condvar.notify_all();
auto erase_count = m_set.erase(name);
m_condvar.notify_all();
THROW_CHECK1(invalid_argument, erase_count, erase_count == 1);
}
template <class T>
void
LockSet<T>::wait_unlock(double interval)
{
unique_lock<mutex> lock(m_mutex);
if (m_set.empty()) return;
m_condvar.wait_for(lock, chrono::duration<double>(interval));
}
template <class T>
size_t
LockSet<T>::size()
@@ -133,7 +165,10 @@ namespace crucible {
LockSet<T>::copy()
{
unique_lock<mutex> lock(m_mutex);
return m_set;
// Make temporary copy of set while protected by mutex
auto rv = m_set;
// Return temporary copy after releasing lock
return rv;
}
template <class T>
@@ -183,26 +218,10 @@ namespace crucible {
}
template <class T>
LockSet<T>::Lock::Lock(Lock &&that) :
m_lockset(that.lockset),
m_name(that.m_name),
m_locked(that.m_locked)
typename LockSet<T>::LockHandle
LockSet<T>::make_lock(const key_type &name, bool start_locked)
{
that.m_locked = false;
}
template <class T>
typename LockSet<T>::Lock &
LockSet<T>::Lock::operator=(Lock &&that)
{
THROW_CHECK2(invalid_argument, &m_lockset, &that.m_lockset, &m_lockset == &that.m_lockset);
if (m_locked && that.m_name != m_name) {
unlock();
}
m_name = that.m_name;
m_locked = that.m_locked;
that.m_locked = false;
return *this;
return LockHandle(*this, name, start_locked);
}
}

View File

@@ -0,0 +1,42 @@
#ifndef CRUCIBLE_MULTILOCK_H
#define CRUCIBLE_MULTILOCK_H
#include <condition_variable>
#include <map>
#include <memory>
#include <mutex>
#include <string>
namespace crucible {
using namespace std;
class MultiLocker {
mutex m_mutex;
condition_variable m_cv;
map<string, size_t> m_counters;
bool m_do_locking = true;
class LockHandle {
const string m_type;
MultiLocker &m_parent;
bool m_locked = false;
void set_locked(bool state);
public:
~LockHandle();
LockHandle(const string &type, MultiLocker &parent);
friend class MultiLocker;
};
friend class LockHandle;
bool is_lock_available(const string &type);
void put_lock(const string &type);
shared_ptr<LockHandle> get_lock_private(const string &type);
public:
static shared_ptr<LockHandle> get_lock(const string &type);
static void enable_locking(bool enabled);
};
}
#endif // CRUCIBLE_MULTILOCK_H

225
include/crucible/namedptr.h Normal file
View File

@@ -0,0 +1,225 @@
#ifndef CRUCIBLE_NAMEDPTR_H
#define CRUCIBLE_NAMEDPTR_H
#include "crucible/lockset.h"
#include <functional>
#include <map>
#include <memory>
#include <mutex>
#include <tuple>
namespace crucible {
using namespace std;
/// A thread-safe container for RAII of shared resources with unique names.
template <class Return, class... Arguments>
class NamedPtr {
public:
/// The name in "NamedPtr"
using Key = tuple<Arguments...>;
/// A shared pointer to the named object with ownership
/// tracking that erases the object's stored name when
/// the last shared pointer is destroyed.
using Ptr = shared_ptr<Return>;
/// A function that translates a name into a shared pointer to an object.
using Func = function<Ptr(Arguments...)>;
private:
struct Value;
using WeakPtr = weak_ptr<Value>;
using MapType = map<Key, WeakPtr>;
struct MapRep {
MapType m_map;
mutex m_mutex;
};
using MapPtr = shared_ptr<MapRep>;
/// Container for Return pointers. Destructor removes entry from map.
struct Value {
Ptr m_ret_ptr;
MapPtr m_map_rep;
Key m_ret_key;
~Value();
Value(Ptr&& ret_ptr, const Key &key, const MapPtr &map_rep);
};
Func m_fn;
MapPtr m_map_rep = make_shared<MapRep>();
LockSet<Key> m_lockset;
Ptr lookup_item(const Key &k);
Ptr insert_item(Func fn, Arguments... args);
public:
NamedPtr(Func f = Func());
void func(Func f);
Ptr operator()(Arguments... args);
Ptr insert(const Ptr &r, Arguments... args);
};
/// Construct NamedPtr map and define a function to turn a name into a pointer.
template <class Return, class... Arguments>
NamedPtr<Return, Arguments...>::NamedPtr(Func f) :
m_fn(f)
{
}
/// Construct a Value wrapper: the value to store, the argument key to store the value under,
/// and a pointer to the map. Everything needed to remove the key from the map when the
/// last NamedPtr is deleted. NamedPtr then releases its own pointer to the value, which
/// may or may not trigger deletion there.
template <class Return, class... Arguments>
NamedPtr<Return, Arguments...>::Value::Value(Ptr&& ret_ptr, const Key &key, const MapPtr &map_rep) :
m_ret_ptr(ret_ptr),
m_map_rep(map_rep),
m_ret_key(key)
{
}
/// Destroy a Value wrapper: remove a dead Key from the map, then let the member destructors
/// do the rest. The Key might be in the map and not dead, so leave it alone in that case.
template <class Return, class... Arguments>
NamedPtr<Return, Arguments...>::Value::~Value()
{
unique_lock<mutex> lock(m_map_rep->m_mutex);
// We are called from the shared_ptr destructor, so we
// know that the weak_ptr in the map has already expired;
// however, if another thread already noticed that the
// map entry expired while we were waiting for the lock,
// the other thread will have already replaced the map
// entry with a pointer to some other object, and that
// object now owns the map entry. So we do a key lookup
// here instead of storing a map iterator, and only erase
// "our" map entry if it exists and is expired. The other
// thread would have done the same for us if the race had
// a different winner.
const auto found = m_map_rep->m_map.find(m_ret_key);
if (found != m_map_rep->m_map.end() && found->second.expired()) {
m_map_rep->m_map.erase(found);
}
}
/// Find a Return by key and fetch a strong Return pointer.
/// Ignore Keys that have expired weak pointers.
template <class Return, class... Arguments>
typename NamedPtr<Return, Arguments...>::Ptr
NamedPtr<Return, Arguments...>::lookup_item(const Key &k)
{
// Must be called with lock held
const auto found = m_map_rep->m_map.find(k);
if (found != m_map_rep->m_map.end()) {
// Get the strong pointer back
const auto rv = found->second.lock();
if (rv) {
// Have strong pointer. Return value that shares map entry.
return shared_ptr<Return>(rv, rv->m_ret_ptr.get());
}
// Have expired weak pointer. Another thread is trying to delete it,
// but we got the lock first. Leave the map entry alone here.
// The other thread will erase it, or we will put a different entry
// in the same map entry.
}
return Ptr();
}
/// Insert the Return value of calling Func(Arguments...).
/// If the value already exists in the map, return the existing value.
/// If another thread is already running Func(Arguments...) then this thread
/// will block until the other thread finishes inserting the Return in the
/// map, and both threads will return the same Return value.
template <class Return, class... Arguments>
typename NamedPtr<Return, Arguments...>::Ptr
NamedPtr<Return, Arguments...>::insert_item(Func fn, Arguments... args)
{
Key k(args...);
// Is it already in the map?
unique_lock<mutex> lock_lookup(m_map_rep->m_mutex);
auto rv = lookup_item(k);
if (rv) {
return rv;
}
// Release map lock and acquire key lock
lock_lookup.unlock();
const auto key_lock = m_lockset.make_lock(k);
// Did item appear in map while we were waiting for key?
lock_lookup.lock();
rv = lookup_item(k);
if (rv) {
return rv;
}
// We now hold key and index locks, but item not in map (or expired).
// Release map lock so other threads can use the map
lock_lookup.unlock();
// Call the function and create a new Value outside of the map
const auto new_value_ptr = make_shared<Value>(fn(args...), k, m_map_rep);
// Function must return a non-null pointer
THROW_CHECK0(runtime_error, new_value_ptr->m_ret_ptr);
// Reacquire index lock for map insertion. We still hold the key lock.
// Use a different lock object to make exceptions unlock in the right order
unique_lock<mutex> lock_insert(m_map_rep->m_mutex);
// Insert return value in map or overwrite existing
// empty or expired weak_ptr value.
WeakPtr &new_item_ref = m_map_rep->m_map[k];
// We searched the map while holding both locks and
// found no entry or an expired weak_ptr; therefore, no
// other thread could have inserted a new non-expired
// weak_ptr, and the weak_ptr in the map is expired
// or was default-constructed as a nullptr. So if the
// new_item_ref is not expired, we have a bug we need
// to find and fix.
assert(new_item_ref.expired());
// Update the map slot we are sure is empty
new_item_ref = new_value_ptr;
// Return shared_ptr to Return using strong pointer's reference counter
return shared_ptr<Return>(new_value_ptr, new_value_ptr->m_ret_ptr.get());
// Release map lock, then key lock
}
/// (Re)define a function to turn a name into a pointer.
template <class Return, class... Arguments>
void
NamedPtr<Return, Arguments...>::func(Func func)
{
unique_lock<mutex> lock(m_map_rep->m_mutex);
m_fn = func;
}
/// Convert a name into a pointer using the configured function.
template<class Return, class... Arguments>
typename NamedPtr<Return, Arguments...>::Ptr
NamedPtr<Return, Arguments...>::operator()(Arguments... args)
{
return insert_item(m_fn, args...);
}
/// Insert a pointer that has already been created under the
/// given name. Useful for inserting a pointer to a derived
/// class when the name doesn't contain all of the information
/// required for the object, or when the Return is already known by
/// some cheaper method than calling the function.
template<class Return, class... Arguments>
typename NamedPtr<Return, Arguments...>::Ptr
NamedPtr<Return, Arguments...>::insert(const Ptr &r, Arguments... args)
{
THROW_CHECK0(invalid_argument, r);
return insert_item([&](Arguments...) { return r; }, args...);
}
}
#endif // CRUCIBLE_NAMEDPTR_H

View File

@@ -7,12 +7,12 @@ namespace crucible {
using namespace std;
struct bits_ntoa_table {
unsigned long n;
unsigned long mask;
unsigned long long n;
unsigned long long mask;
const char *a;
};
string bits_ntoa(unsigned long n, const bits_ntoa_table *a);
string bits_ntoa(unsigned long long n, const bits_ntoa_table *a);
};
@@ -20,9 +20,9 @@ namespace crucible {
#define NTOA_TABLE_ENTRY_BITS(x) { .n = (x), .mask = (x), .a = (#x) }
// Enumerations (entire value matches all bits)
#define NTOA_TABLE_ENTRY_ENUM(x) { .n = (x), .mask = ~0UL, .a = (#x) }
#define NTOA_TABLE_ENTRY_ENUM(x) { .n = (x), .mask = ~0ULL, .a = (#x) }
// End of table (sorry, gcc doesn't implement this)
// End of table (sorry, C++ didn't get C99's compound literals, so we have to write out all the member names)
#define NTOA_TABLE_ENTRY_END() { .n = 0, .mask = 0, .a = nullptr }
#endif // CRUCIBLE_NTOA_H

View File

@@ -0,0 +1,52 @@
#ifndef CRUCIBLE_OPENAT2_H
#define CRUCIBLE_OPENAT2_H
#include <cstdlib>
// Compatibility for building on old libc for new kernel
#include <linux/version.h>
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0)
#include <linux/openat2.h>
#else
#include <linux/types.h>
#ifndef RESOLVE_NO_XDEV
#define RESOLVE_NO_XDEV 1
// RESOLVE_NO_XDEV was there from the beginning of openat2,
// so if that's missing, so is open_how
struct open_how {
__u64 flags;
__u64 mode;
__u64 resolve;
};
#endif
#ifndef RESOLVE_NO_MAGICLINKS
#define RESOLVE_NO_MAGICLINKS 2
#endif
#ifndef RESOLVE_NO_SYMLINKS
#define RESOLVE_NO_SYMLINKS 4
#endif
#ifndef RESOLVE_BENEATH
#define RESOLVE_BENEATH 8
#endif
#ifndef RESOLVE_IN_ROOT
#define RESOLVE_IN_ROOT 16
#endif
#endif // Linux version >= v5.6
extern "C" {
/// Weak symbol to support libc with no syscall wrapper
int openat2(int dirfd, const char *pathname, struct open_how *how, size_t size) throw();
};
#endif // CRUCIBLE_OPENAT2_H

185
include/crucible/pool.h Normal file
View File

@@ -0,0 +1,185 @@
#ifndef CRUCIBLE_POOL_H
#define CRUCIBLE_POOL_H
#include "crucible/error.h"
#include <functional>
#include <list>
#include <memory>
#include <mutex>
namespace crucible {
using namespace std;
/// Storage for reusable anonymous objects that are too expensive to create and/or destroy frequently
template <class T>
class Pool {
public:
using Ptr = shared_ptr<T>;
using Generator = function<Ptr()>;
using Checker = function<void(Ptr)>;
~Pool();
Pool(Generator f = Generator(), Checker checkin = Checker(), Checker checkout = Checker());
/// Function to create new objects when Pool is empty
void generator(Generator f);
/// Optional function called when objects exit the pool (user handle is created and returned to user)
void checkout(Checker f);
/// Optional function called when objects enter the pool (last user handle is destroyed)
void checkin(Checker f);
/// Pool() returns a handle to an object of type shared_ptr<T>
Ptr operator()();
/// Destroy all objects in Pool that are not in use
void clear();
private:
struct PoolRep {
list<Ptr> m_list;
mutex m_mutex;
Checker m_checkin;
PoolRep(Checker checkin);
};
struct Handle {
weak_ptr<PoolRep> m_list_rep;
Ptr m_ret_ptr;
Handle(shared_ptr<PoolRep> list_rep, Ptr ret_ptr);
~Handle();
};
Generator m_fn;
Checker m_checkout;
shared_ptr<PoolRep> m_list_rep;
};
template <class T>
Pool<T>::PoolRep::PoolRep(Checker checkin) :
m_checkin(checkin)
{
}
template <class T>
Pool<T>::Pool(Generator f, Checker checkin, Checker checkout) :
m_fn(f),
m_checkout(checkout),
m_list_rep(make_shared<PoolRep>(checkin))
{
}
template <class T>
Pool<T>::~Pool()
{
auto list_rep = m_list_rep;
unique_lock<mutex> lock(list_rep->m_mutex);
m_list_rep.reset();
}
template <class T>
Pool<T>::Handle::Handle(shared_ptr<PoolRep> list_rep, Ptr ret_ptr) :
m_list_rep(list_rep),
m_ret_ptr(ret_ptr)
{
}
template <class T>
Pool<T>::Handle::~Handle()
{
// Checkin prepares the object for storage and reuse.
// Neither of those will happen if there is no Pool.
// If the Pool was destroyed, just let m_ret_ptr expire.
auto list_rep = m_list_rep.lock();
if (!list_rep) {
return;
}
unique_lock<mutex> lock(list_rep->m_mutex);
// If a checkin function is defined, call it
auto checkin = list_rep->m_checkin;
if (checkin) {
lock.unlock();
checkin(m_ret_ptr);
lock.lock();
}
// Place object back in pool
list_rep->m_list.push_front(m_ret_ptr);
}
template <class T>
typename Pool<T>::Ptr
Pool<T>::operator()()
{
Ptr rv;
// Do we have an object in the pool we can return instead?
unique_lock<mutex> lock(m_list_rep->m_mutex);
if (m_list_rep->m_list.empty()) {
// No, release cache lock and call the function
lock.unlock();
// Create new value
rv = m_fn();
} else {
rv = m_list_rep->m_list.front();
m_list_rep->m_list.pop_front();
// Release lock so we don't deadlock with Handle destructor
lock.unlock();
}
// rv now points to a T object that is not in the list.
THROW_CHECK0(runtime_error, rv);
// Construct a shared_ptr for Handle which will refcount the Handle objects
// and reinsert the T into the Pool when the last Handle is destroyed.
auto hv = make_shared<Handle>(m_list_rep, rv);
// If a checkout function is defined, call it
if (m_checkout) {
m_checkout(rv);
}
// T an alias shared_ptr for the T using Handle's refcount.
return Ptr(hv, rv.get());
}
template <class T>
void
Pool<T>::generator(Generator func)
{
unique_lock<mutex> lock(m_list_rep->m_mutex);
m_fn = func;
}
template <class T>
void
Pool<T>::checkin(Checker func)
{
unique_lock<mutex> lock(m_list_rep->m_mutex);
m_list_rep->m_checkin = func;
}
template <class T>
void
Pool<T>::checkout(Checker func)
{
unique_lock<mutex> lock(m_list_rep->m_mutex);
m_checkout = func;
}
template <class T>
void
Pool<T>::clear()
{
unique_lock<mutex> lock(m_list_rep->m_mutex);
m_list_rep->m_list.clear();
}
}
#endif // POOL_H

View File

@@ -10,6 +10,10 @@
#include <sys/wait.h>
#include <unistd.h>
extern "C" {
pid_t gettid() throw();
};
namespace crucible {
using namespace std;
@@ -73,6 +77,10 @@ namespace crucible {
typedef ResourceHandle<Process::id, Process> Pid;
pid_t gettid();
double getloadavg1();
double getloadavg5();
double getloadavg15();
string signal_ntoa(int sig);
}
#endif // CRUCIBLE_PROCESS_H

140
include/crucible/progress.h Normal file
View File

@@ -0,0 +1,140 @@
#ifndef CRUCIBLE_PROGRESS_H
#define CRUCIBLE_PROGRESS_H
#include "crucible/error.h"
#include <functional>
#include <memory>
#include <mutex>
#include <set>
#include <cassert>
namespace crucible {
using namespace std;
/// A class to track progress of multiple workers using only two points:
/// the first and last incomplete state. The first incomplete
/// state can be recorded as a checkpoint to resume later on.
/// The last completed state is the starting point for workers that
/// need something to do.
template <class T>
class ProgressTracker {
struct ProgressTrackerState;
class ProgressHolderState;
public:
using value_type = T;
using ProgressHolder = shared_ptr<ProgressHolderState>;
/// Create ProgressTracker with initial begin and end state 'v'.
ProgressTracker(const value_type &v);
/// The first incomplete state. This is not "sticky",
/// it will revert to the end state if there are no
/// items in progress.
value_type begin() const;
/// The last incomplete state. This is "sticky",
/// it can only increase and never decrease.
value_type end() const;
ProgressHolder hold(const value_type &v);
friend class ProgressHolderState;
private:
struct ProgressTrackerState {
using key_type = pair<value_type, ProgressHolderState *>;
mutex m_mutex;
set<key_type> m_in_progress;
value_type m_begin;
value_type m_end;
};
class ProgressHolderState {
shared_ptr<ProgressTrackerState> m_state;
const value_type m_value;
using key_type = typename ProgressTrackerState::key_type;
public:
ProgressHolderState(shared_ptr<ProgressTrackerState> state, const value_type &v);
~ProgressHolderState();
value_type get() const;
};
shared_ptr<ProgressTrackerState> m_state;
};
template <class T>
typename ProgressTracker<T>::value_type
ProgressTracker<T>::begin() const
{
unique_lock<mutex> lock(m_state->m_mutex);
return m_state->m_begin;
}
template <class T>
typename ProgressTracker<T>::value_type
ProgressTracker<T>::end() const
{
unique_lock<mutex> lock(m_state->m_mutex);
return m_state->m_end;
}
template <class T>
typename ProgressTracker<T>::value_type
ProgressTracker<T>::ProgressHolderState::get() const
{
return m_value;
}
template <class T>
ProgressTracker<T>::ProgressTracker(const ProgressTracker::value_type &t) :
m_state(make_shared<ProgressTrackerState>())
{
m_state->m_begin = t;
m_state->m_end = t;
}
template <class T>
ProgressTracker<T>::ProgressHolderState::ProgressHolderState(shared_ptr<ProgressTrackerState> state, const value_type &v) :
m_state(state),
m_value(v)
{
unique_lock<mutex> lock(m_state->m_mutex);
const auto rv = m_state->m_in_progress.insert(key_type(m_value, this));
THROW_CHECK1(runtime_error, m_value, rv.second);
// Set the beginning to the first existing in-progress item
m_state->m_begin = m_state->m_in_progress.begin()->first;
// If this value is past the end, move the end, but don't go backwards
if (m_state->m_end < m_value) {
m_state->m_end = m_value;
}
}
template <class T>
ProgressTracker<T>::ProgressHolderState::~ProgressHolderState()
{
unique_lock<mutex> lock(m_state->m_mutex);
const auto rv = m_state->m_in_progress.erase(key_type(m_value, this));
// THROW_CHECK2(runtime_error, m_value, rv, rv == 1);
assert(rv == 1);
if (m_state->m_in_progress.empty()) {
// If we made the list empty, then m_begin == m_end
m_state->m_begin = m_state->m_end;
} else {
// If we deleted the first element, then m_begin = current first element
m_state->m_begin = m_state->m_in_progress.begin()->first;
}
}
template <class T>
shared_ptr<typename ProgressTracker<T>::ProgressHolderState>
ProgressTracker<T>::hold(const value_type &v)
{
return make_shared<ProgressHolderState>(m_state, v);
}
}
#endif // CRUCIBLE_PROGRESS_H

View File

@@ -8,6 +8,7 @@
#include <memory>
#include <mutex>
#include <iostream>
#include <stdexcept>
namespace crucible {
using namespace std;
@@ -44,36 +45,29 @@ namespace crucible {
private:
using traits_type = ResourceTraits<Key, Resource>;
class ResourceHolder {
resource_ptr_type m_ptr;
public:
~ResourceHolder();
ResourceHolder(resource_ptr_type that);
ResourceHolder(const ResourceHolder &that) = default;
ResourceHolder(ResourceHolder &&that) = default;
ResourceHolder& operator=(ResourceHolder &&that) = default;
ResourceHolder& operator=(const ResourceHolder &that) = default;
resource_ptr_type get_resource_ptr() const;
};
using holder_ptr_type = shared_ptr<ResourceHolder>;
using weak_holder_ptr_type = weak_ptr<ResourceHolder>;
using map_type = map<key_type, weak_holder_ptr_type>;
using weak_ptr_type = weak_ptr<Resource>;
using map_type = map<key_type, weak_ptr_type>;
// The only instance variable
holder_ptr_type m_ptr;
resource_ptr_type m_ptr;
// A bunch of static variables and functions
static mutex &s_mutex();
static shared_ptr<map_type> s_map();
static holder_ptr_type insert(const key_type &key);
static holder_ptr_type insert(const resource_ptr_type &res);
static void erase(const key_type &key);
static mutex s_map_mutex;
static map_type s_map;
static resource_ptr_type insert(const key_type &key);
static resource_ptr_type insert(const resource_ptr_type &res);
static void clean_locked();
static ResourceTraits<Key, Resource> s_traits;
public:
// Exceptions
struct duplicate_resource : public invalid_argument {
key_type m_key;
key_type get_key() const;
duplicate_resource(const key_type &key);
};
// test for resource. A separate operator because key_type could be confused with bool.
bool operator!() const;
@@ -89,8 +83,15 @@ namespace crucible {
ResourceHandle(const resource_ptr_type &res);
ResourceHandle& operator=(const resource_ptr_type &res);
// default constructor is public
// default construct/assign/move is public and mostly harmless
ResourceHandle() = default;
ResourceHandle(const ResourceHandle &that) = default;
ResourceHandle(ResourceHandle &&that) = default;
ResourceHandle& operator=(const ResourceHandle &that) = default;
ResourceHandle& operator=(ResourceHandle &&that) = default;
// Nontrivial destructor
~ResourceHandle();
// forward anything else to the Resource constructor
// if we can do so unambiguously
@@ -109,7 +110,7 @@ namespace crucible {
// get pointer to Resource object (nothrow, result may be null)
resource_ptr_type get_resource_ptr() const;
// this version throws and is probably not thread safe
// this version throws
resource_ptr_type operator->() const;
// dynamic casting of the resource (throws if cast fails)
@@ -145,139 +146,94 @@ namespace crucible {
}
template <class Key, class Resource>
ResourceHandle<Key, Resource>::ResourceHolder::ResourceHolder(resource_ptr_type that) :
m_ptr(that)
ResourceHandle<Key, Resource>::duplicate_resource::duplicate_resource(const key_type &key) :
invalid_argument("duplicate resource"),
m_key(key)
{
// Cannot insert ourselves here since our shared_ptr does not exist yet.
}
template <class Key, class Resource>
mutex &
ResourceHandle<Key, Resource>::s_mutex()
auto
ResourceHandle<Key, Resource>::duplicate_resource::get_key() const -> key_type
{
static mutex gcc_won_t_instantiate_this_either;
return gcc_won_t_instantiate_this_either;
}
template <class Key, class Resource>
shared_ptr<typename ResourceHandle<Key, Resource>::map_type>
ResourceHandle<Key, Resource>::s_map()
{
static shared_ptr<map_type> gcc_won_t_instantiate_the_damn_static_vars;
if (!gcc_won_t_instantiate_the_damn_static_vars) {
gcc_won_t_instantiate_the_damn_static_vars = make_shared<map_type>();
}
return gcc_won_t_instantiate_the_damn_static_vars;
return m_key;
}
template <class Key, class Resource>
void
ResourceHandle<Key, Resource>::erase(const key_type &key)
ResourceHandle<Key, Resource>::clean_locked()
{
unique_lock<mutex> lock(s_mutex());
// Resources are allowed to set their Keys to null.
if (s_traits.is_null_key(key)) {
// Clean out any dead weak_ptr objects.
for (auto i = s_map()->begin(); i != s_map()->end(); ) {
if (! (*i).second.lock()) {
i = s_map()->erase(i);
} else {
++i;
}
// Must be called with lock held
for (auto i = s_map.begin(); i != s_map.end(); ) {
auto this_i = i;
++i;
if (this_i->second.expired()) {
s_map.erase(this_i);
}
return;
}
auto erased = s_map()->erase(key);
if (erased != 1) {
cerr << __PRETTY_FUNCTION__ << ": WARNING: s_map()->erase(" << key << ") returned " << erased << " != 1" << endl;
}
}
template <class Key, class Resource>
ResourceHandle<Key, Resource>::ResourceHolder::~ResourceHolder()
{
if (!m_ptr) {
// Probably something harmless like a failed constructor.
cerr << __PRETTY_FUNCTION__ << ": WARNING: destroying null m_ptr" << endl;
return;
}
Key key = s_traits.get_key(*m_ptr);
ResourceHandle::erase(key);
}
template <class Key, class Resource>
typename ResourceHandle<Key, Resource>::holder_ptr_type
typename ResourceHandle<Key, Resource>::resource_ptr_type
ResourceHandle<Key, Resource>::insert(const key_type &key)
{
// no Resources for null keys
if (s_traits.is_null_key(key)) {
return holder_ptr_type();
return resource_ptr_type();
}
unique_lock<mutex> lock(s_mutex());
// find ResourceHolder for non-null key
auto found = s_map()->find(key);
if (found != s_map()->end()) {
holder_ptr_type rv = (*found).second.lock();
// a weak_ptr may have expired
unique_lock<mutex> lock(s_map_mutex);
auto found = s_map.find(key);
if (found != s_map.end()) {
resource_ptr_type rv = found->second.lock();
if (rv) {
// Use existing Resource
return rv;
} else {
// It's OK for the map to temporarily contain an expired weak_ptr to some dead Resource
clean_locked();
}
}
// not found or expired, throw any existing ref away and make a new one
resource_ptr_type rpt = s_traits.make_resource(key);
holder_ptr_type hpt = make_shared<ResourceHolder>(rpt);
// store weak_ptr in map
(*s_map())[key] = hpt;
s_map[key] = rpt;
// return shared_ptr
return hpt;
return rpt;
};
template <class Key, class Resource>
typename ResourceHandle<Key, Resource>::holder_ptr_type
typename ResourceHandle<Key, Resource>::resource_ptr_type
ResourceHandle<Key, Resource>::insert(const resource_ptr_type &res)
{
// no Resource, no ResourceHolder.
// no Resources for null keys
if (!res) {
return holder_ptr_type();
return resource_ptr_type();
}
// no ResourceHolders for null keys either.
key_type key = s_traits.get_key(*res);
if (s_traits.is_null_key(key)) {
return holder_ptr_type();
return resource_ptr_type();
}
unique_lock<mutex> lock(s_mutex());
// find ResourceHolder for non-null key
auto found = s_map()->find(key);
if (found != s_map()->end()) {
holder_ptr_type rv = (*found).second.lock();
// The map doesn't own the ResourceHolders, the ResourceHandles do.
// It's OK for the map to contain an expired weak_ptr to some dead ResourceHolder...
unique_lock<mutex> lock(s_map_mutex);
// find Resource for non-null key
auto found = s_map.find(key);
if (found != s_map.end()) {
resource_ptr_type rv = found->second.lock();
// It's OK for the map to temporarily contain an expired weak_ptr to some dead Resource...
if (rv) {
// found ResourceHolder, look at pointer
resource_ptr_type rp = rv->get_resource_ptr();
// We do not store references to null Resources.
assert(rp);
// Key retrieved for an existing object must match key searched or be null.
key_type found_key = s_traits.get_key(*rp);
bool found_key_is_null = s_traits.is_null_key(found_key);
assert(found_key_is_null || found_key == key);
if (!found_key_is_null) {
// We do not store references to duplicate resources.
if (rp.owner_before(res) || res.owner_before(rp)) {
cerr << "inserting new Resource with existing Key " << key << " not allowed at " << __PRETTY_FUNCTION__ << endl;;
abort();
// THROW_ERROR(out_of_range, "inserting new Resource with existing Key " << key << " not allowed at " << __PRETTY_FUNCTION__);
}
// rv is good, return it
return rv;
// ...but not a duplicate Resource.
if (rv.owner_before(res) || res.owner_before(rv)) {
throw duplicate_resource(key);
}
// Use the existing Resource (discard the caller's).
return rv;
} else {
// Clean out expired weak_ptrs
clean_locked();
}
}
// not found or expired, make a new one
holder_ptr_type rv = make_shared<ResourceHolder>(res);
s_map()->insert(make_pair(key, weak_holder_ptr_type(rv)));
// no need to check s_map result, we are either replacing a dead weak_ptr or adding a new one
return rv;
// not found or expired, make a new one or replace old one
s_map[key] = res;
return res;
};
template <class Key, class Resource>
@@ -309,31 +265,47 @@ namespace crucible {
}
template <class Key, class Resource>
typename ResourceHandle<Key, Resource>::resource_ptr_type
ResourceHandle<Key, Resource>::ResourceHolder::get_resource_ptr() const
ResourceHandle<Key, Resource>::~ResourceHandle()
{
return m_ptr;
// No pointer, nothing to do
if (!m_ptr) {
return;
}
// Save key so we can clean the map
auto key = s_traits.get_key(*m_ptr);
// Save a weak_ptr so we can tell if we need to clean the map
weak_ptr_type wp = m_ptr;
// Drop shared_ptr
m_ptr.reset();
// If there are still other references to the shared_ptr, we can stop now
if (!wp.expired()) {
return;
}
// Remove weak_ptr from map if it has expired
// (and not been replaced in the meantime)
unique_lock<mutex> lock_map(s_map_mutex);
auto found = s_map.find(key);
// Map entry may have been replaced, so check for expiry again
if (found != s_map.end() && found->second.expired()) {
s_map.erase(key);
}
}
template <class Key, class Resource>
typename ResourceHandle<Key, Resource>::resource_ptr_type
ResourceHandle<Key, Resource>::get_resource_ptr() const
{
if (!m_ptr) {
return resource_ptr_type();
}
return m_ptr->get_resource_ptr();
return m_ptr;
}
template <class Key, class Resource>
typename ResourceHandle<Key, Resource>::resource_ptr_type
ResourceHandle<Key, Resource>::operator->() const
{
resource_ptr_type rp = get_resource_ptr();
if (!rp) {
if (!m_ptr) {
THROW_ERROR(out_of_range, __PRETTY_FUNCTION__ << " called on null Resource");
}
return rp;
return m_ptr;
}
template <class Key, class Resource>
@@ -342,11 +314,10 @@ namespace crucible {
ResourceHandle<Key, Resource>::cast() const
{
shared_ptr<T> dp;
resource_ptr_type rp = get_resource_ptr();
if (!rp) {
if (!m_ptr) {
return dp;
}
dp = dynamic_pointer_cast<T>(rp);
dp = dynamic_pointer_cast<T>(m_ptr);
if (!dp) {
throw bad_cast();
}
@@ -357,11 +328,10 @@ namespace crucible {
typename ResourceHandle<Key, Resource>::key_type
ResourceHandle<Key, Resource>::get_key() const
{
resource_ptr_type rp = get_resource_ptr();
if (!rp) {
if (!m_ptr) {
return s_traits.get_null_key();
} else {
return s_traits.get_key(*rp);
return s_traits.get_key(*m_ptr);
}
}
@@ -378,9 +348,15 @@ namespace crucible {
return s_traits.is_null_key(operator key_type());
}
// Apparently GCC wants these to be used before they are defined.
template <class Key, class Resource>
ResourceTraits<Key, Resource> ResourceHandle<Key, Resource>::s_traits;
template <class Key, class Resource>
mutex ResourceHandle<Key, Resource>::s_map_mutex;
template <class Key, class Resource>
typename ResourceHandle<Key, Resource>::map_type ResourceHandle<Key, Resource>::s_map;
}

158
include/crucible/seeker.h Normal file
View File

@@ -0,0 +1,158 @@
#ifndef _CRUCIBLE_SEEKER_H_
#define _CRUCIBLE_SEEKER_H_
#include "crucible/error.h"
#include <algorithm>
#include <limits>
// Debug stream
#include <memory>
#include <iostream>
#include <sstream>
#include <cstdint>
namespace crucible {
using namespace std;
extern thread_local shared_ptr<ostream> tl_seeker_debug_str;
#define SEEKER_DEBUG_LOG(__x) do { \
if (tl_seeker_debug_str) { \
(*tl_seeker_debug_str) << __x << "\n"; \
} \
} while (false)
// Requirements for Container<Pos> Fetch(Pos lower, Pos upper):
// - fetches objects in Pos order, starting from lower (must be >= lower)
// - must return upper if present, may or may not return objects after that
// - returns a container of Pos objects with begin(), end(), rbegin(), rend()
// - container must iterate over objects in Pos order
// - uniqueness of Pos objects not required
// - should store the underlying data as a side effect
//
// Requirements for Pos:
// - should behave like an unsigned integer type
// - must have specializations in numeric_limits<T> for digits, max(), min()
// - must support +, -, -=, and related operators
// - must support <, <=, ==, and related operators
// - must support Pos / 2 (only)
//
// Requirements for seek_backward:
// - calls Fetch to search Pos space near target_pos
// - if no key exists with value <= target_pos, returns the minimum Pos value
// - returns the highest key value <= target_pos
// - returned key value may not be part of most recent Fetch result
// - 1 loop iteration when target_pos exists
template <class Fetch, class Pos = uint64_t>
Pos
seek_backward(Pos const target_pos, Fetch fetch, Pos min_step = 1, size_t max_loops = numeric_limits<size_t>::max())
{
static const Pos end_pos = numeric_limits<Pos>::max();
// TBH this probably won't work if begin_pos != 0, i.e. any signed type
static const Pos begin_pos = numeric_limits<Pos>::min();
// Run a binary search looking for the highest key below target_pos.
// Initial upper bound of the search is target_pos.
// Find initial lower bound by doubling the size of the range until a key below target_pos
// is found, or the lower bound reaches the beginning of the search space.
// If the lower bound search reaches the beginning of the search space without finding a key,
// return the beginning of the search space; otherwise, perform a binary search between
// the bounds now established.
Pos lower_bound = 0;
Pos upper_bound = target_pos;
bool found_low = false;
Pos probe_pos = target_pos;
// We need one loop for each bit of the search space to find the lower bound,
// one loop for each bit of the search space to find the upper bound,
// and one extra loop to confirm the boundary is correct.
for (size_t loop_count = min((1 + numeric_limits<Pos>::digits) * size_t(2), max_loops); loop_count; --loop_count) {
SEEKER_DEBUG_LOG("fetch(probe_pos = " << probe_pos << ", target_pos = " << target_pos << ")");
auto result = fetch(probe_pos, target_pos);
const Pos low_pos = result.empty() ? end_pos : *result.begin();
const Pos high_pos = result.empty() ? end_pos : *result.rbegin();
SEEKER_DEBUG_LOG(" = " << low_pos << ".." << high_pos);
// check for correct behavior of the fetch function
THROW_CHECK2(out_of_range, high_pos, probe_pos, probe_pos <= high_pos);
THROW_CHECK2(out_of_range, low_pos, probe_pos, probe_pos <= low_pos);
THROW_CHECK2(out_of_range, low_pos, high_pos, low_pos <= high_pos);
if (!found_low) {
// if target_pos == end_pos then we will find it in every empty result set,
// so in that case we force the lower bound to be lower than end_pos
if ((target_pos == end_pos) ? (low_pos < target_pos) : (low_pos <= target_pos)) {
// found a lower bound, set the low bound there and switch to binary search
found_low = true;
lower_bound = low_pos;
SEEKER_DEBUG_LOG("found_low = true, lower_bound = " << lower_bound);
} else {
// still looking for lower bound
// if probe_pos was begin_pos then we can stop with no result
if (probe_pos == begin_pos) {
SEEKER_DEBUG_LOG("return: probe_pos == begin_pos " << begin_pos);
return begin_pos;
}
// double the range size, or use the distance between objects found so far
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
// already checked low_pos <= high_pos above
const Pos want_delta = max(upper_bound - probe_pos, min_step);
// avoid underflowing the beginning of the search space
const Pos have_delta = min(want_delta, probe_pos - begin_pos);
THROW_CHECK2(out_of_range, want_delta, have_delta, have_delta <= want_delta);
// move probe and try again
probe_pos = probe_pos - have_delta;
SEEKER_DEBUG_LOG("probe_pos " << probe_pos << " = probe_pos - have_delta " << have_delta << " (want_delta " << want_delta << ")");
continue;
}
}
if (low_pos <= target_pos && target_pos <= high_pos) {
// have keys on either side of target_pos in result
// search from the high end until we find the highest key below target
for (auto i = result.rbegin(); i != result.rend(); ++i) {
// more correctness checking for fetch
THROW_CHECK2(out_of_range, *i, probe_pos, probe_pos <= *i);
if (*i <= target_pos) {
SEEKER_DEBUG_LOG("return: *i " << *i << " <= target_pos " << target_pos);
return *i;
}
}
// if the list is empty then low_pos = high_pos = end_pos
// if target_pos = end_pos also, then we will execute the loop
// above but not find any matching entries.
THROW_CHECK0(runtime_error, result.empty());
}
if (target_pos <= low_pos) {
// results are all too high, so probe_pos..low_pos is too high
// lower the high bound to the probe pos, low_pos cannot be lower
SEEKER_DEBUG_LOG("upper_bound = probe_pos " << probe_pos);
upper_bound = probe_pos;
}
if (high_pos < target_pos) {
// results are all too low, so probe_pos..high_pos is too low
// raise the low bound to high_pos but not above upper_bound
const auto next_pos = min(high_pos, upper_bound);
SEEKER_DEBUG_LOG("lower_bound = next_pos " << next_pos);
lower_bound = next_pos;
}
// compute a new probe pos at the middle of the range and try again
// we can't have a zero-size range here because we would not have set found_low yet
THROW_CHECK2(out_of_range, lower_bound, upper_bound, lower_bound <= upper_bound);
const Pos delta = (upper_bound - lower_bound) / 2;
probe_pos = lower_bound + delta;
if (delta < 1) {
// nothing can exist in the range (lower_bound, upper_bound)
// and an object is known to exist at lower_bound
SEEKER_DEBUG_LOG("return: probe_pos == lower_bound " << lower_bound);
return lower_bound;
}
THROW_CHECK2(out_of_range, lower_bound, probe_pos, lower_bound <= probe_pos);
THROW_CHECK2(out_of_range, upper_bound, probe_pos, probe_pos <= upper_bound);
SEEKER_DEBUG_LOG("loop bottom: lower_bound " << lower_bound << ", probe_pos " << probe_pos << ", upper_bound " << upper_bound);
}
THROW_ERROR(runtime_error, "FIXME: should not reach this line: "
"lower_bound..upper_bound " << lower_bound << ".." << upper_bound << ", "
"found_low " << found_low);
}
}
#endif // _CRUCIBLE_SEEKER_H_

View File

@@ -11,23 +11,6 @@
namespace crucible {
using namespace std;
// Zero-initialize a base class object (usually a C struct)
template <class Base>
void
memset_zero(Base *that)
{
memset(that, 0, sizeof(Base));
}
// Copy a base class object (usually a C struct) into a vector<char>
template <class Base>
vector<char>
vector_copy_struct(Base *that)
{
const char *begin_that = reinterpret_cast<const char *>(static_cast<const Base *>(that));
return vector<char>(begin_that, begin_that + sizeof(Base));
}
// int->hex conversion with sprintf
string to_hex(uint64_t i);
@@ -60,7 +43,7 @@ namespace crucible {
ptrdiff_t
pointer_distance(const P1 *a, const P2 *b)
{
return reinterpret_cast<const char *>(a) - reinterpret_cast<const char *>(b);
return reinterpret_cast<const uint8_t *>(a) - reinterpret_cast<const uint8_t *>(b);
}
};

106
include/crucible/table.h Normal file
View File

@@ -0,0 +1,106 @@
#ifndef CRUCIBLE_TABLE_H
#define CRUCIBLE_TABLE_H
#include <functional>
#include <limits>
#include <map>
#include <memory>
#include <ostream>
#include <sstream>
#include <string>
#include <vector>
namespace crucible {
namespace Table {
using namespace std;
using Content = function<string(size_t width, size_t height)>;
const size_t endpos = numeric_limits<size_t>::max();
Content Fill(const char c);
Content Text(const string& s);
template <class T>
Content Number(const T& num)
{
ostringstream oss;
oss << num;
return Text(oss.str());
}
class Cell {
Content m_content;
public:
Cell(const Content &fn = [](size_t, size_t) { return string(); } );
Cell& operator=(const Content &fn);
string text(size_t width, size_t height) const;
};
class Dimension {
size_t m_next_pos = 0;
vector<size_t> m_elements;
friend class Table;
size_t at(size_t) const;
public:
size_t size() const;
size_t insert(size_t pos);
void erase(size_t pos);
};
class Table {
Dimension m_rows, m_cols;
map<pair<size_t, size_t>, Cell> m_cells;
string m_left = "|";
string m_mid = "|";
string m_right = "|";
public:
Dimension &rows();
const Dimension& rows() const;
Dimension &cols();
const Dimension& cols() const;
Cell& at(size_t row, size_t col);
const Cell& at(size_t row, size_t col) const;
template <class T> void insert_row(size_t pos, const T& container);
template <class T> void insert_col(size_t pos, const T& container);
void left(const string &s);
void mid(const string &s);
void right(const string &s);
const string& left() const;
const string& mid() const;
const string& right() const;
};
ostream& operator<<(ostream &os, const Table &table);
template <class T>
void
Table::insert_row(size_t pos, const T& container)
{
const auto new_pos = m_rows.insert(pos);
size_t col = 0;
for (const auto &i : container) {
if (col >= cols().size()) {
cols().insert(col);
}
at(new_pos, col++) = i;
}
}
template <class T>
void
Table::insert_col(size_t pos, const T& container)
{
const auto new_pos = m_cols.insert(pos);
size_t row = 0;
for (const auto &i : container) {
if (row >= rows().size()) {
rows().insert(row);
}
at(row++, new_pos) = i;
}
}
}
}
#endif // CRUCIBLE_TABLE_H

188
include/crucible/task.h Normal file
View File

@@ -0,0 +1,188 @@
#ifndef CRUCIBLE_TASK_H
#define CRUCIBLE_TASK_H
#include <functional>
#include <memory>
#include <mutex>
#include <ostream>
#include <string>
namespace crucible {
using namespace std;
class TaskState;
using TaskId = uint64_t;
/// A unit of work to be scheduled by TaskMaster.
class Task {
shared_ptr<TaskState> m_task_state;
Task(shared_ptr<TaskState> pts);
public:
/// Create empty Task object.
Task() = default;
/// Create Task object containing closure and description.
Task(string title, function<void()> exec_fn);
/// Schedule Task for at most one future execution.
/// May run Task in current thread or in other thread.
/// May run Task before or after returning.
/// Schedules Task at the end of the global execution queue.
///
/// Only one instance of a Task may execute at a time.
/// If a Task is already scheduled, run() does nothing.
/// If a Task is already running when a new instance reaches
/// the front of the queue, the new instance will execute
/// after the current instance exits.
void run() const;
/// Schedule task to run when no other Task is available.
void idle() const;
/// Schedule Task to run after this Task has run or
/// been destroyed.
void append(const Task &task) const;
/// Schedule Task to run after this Task has run or
/// been destroyed, in Task ID order.
void insert(const Task &task) const;
/// Describe Task as text.
string title() const;
/// Returns currently executing task if called from exec_fn.
/// Usually used to reschedule the currently executing Task.
static Task current_task();
/// Returns number of currently existing Task objects.
/// Good for spotting leaks.
static size_t instance_count();
/// Ordering operator for containers
bool operator<(const Task &that) const;
/// Null test
operator bool() const;
/// Unique non-repeating(ish) ID for task
TaskId id() const;
};
ostream &operator<<(ostream &os, const Task &task);
class TaskMaster {
public:
/// Blocks until the running thread count reaches this number
static void set_thread_count(size_t threads);
/// Sets minimum thread count when load average tracking enabled
static void set_thread_min_count(size_t min_threads);
/// Calls set_thread_count with default
static void set_thread_count();
/// Creates thread to track load average and adjust thread count dynamically
static void set_loadavg_target(double target);
/// Writes the current non-executing Task queue
static ostream & print_queue(ostream &);
/// Writes the current executing Task for each worker
static ostream & print_workers(ostream &);
/// Gets the current number of queued Tasks
static size_t get_queue_count();
/// Gets the current number of active workers
static size_t get_thread_count();
/// Gets the current load tracking statistics
struct LoadStats {
/// Current load extracted from last two 5-second load average samples
double current_load;
/// Target thread count computed from previous thread count and current load
double thread_target;
/// Load average for last 60 seconds
double loadavg;
};
static LoadStats get_current_load();
/// Drop the current queue and discard new Tasks without
/// running them. Currently executing tasks are not
/// affected (use set_thread_count(0) to wait for those
/// to complete).
static void cancel();
/// Stop running any new Tasks. All existing
/// Consumer threads will exit. Does not affect queue.
/// Does not wait for threads to exit. Reversible.
static void pause(bool paused = true);
};
class BarrierState;
/// Barrier delays the execution of one or more Tasks.
/// The Tasks are executed when the last shared reference to the
/// BarrierState is released. Copies of Barrier objects refer
/// to the same Barrier state.
class Barrier {
shared_ptr<BarrierState> m_barrier_state;
public:
Barrier();
/// Schedule a task for execution when last Barrier is released.
void insert_task(Task t);
/// Release this reference to the barrier state.
/// Last released reference executes the task.
/// Barrier can only be released once, after which the
/// object can no longer be used.
void release();
};
class ExclusionLock {
shared_ptr<Task> m_owner;
ExclusionLock(shared_ptr<Task> owner);
friend class Exclusion;
public:
/// Explicit default constructor because we have other kinds
ExclusionLock() = default;
/// Release this Lock immediately and permanently
void release();
/// Test for locked state
operator bool() const;
};
class Exclusion {
mutex m_mutex;
weak_ptr<Task> m_owner;
public:
/// Attempt to obtain a Lock. If successful, current Task
/// owns the Lock until the ExclusionLock is released
/// (it is the ExclusionLock that owns the lock, so it can
/// be passed to other Tasks or threads, but this is not
/// recommended practice).
/// If not successful, the argument Task is appended to the
/// task that currently holds the lock. Current task is
/// expected to immediately release any other ExclusionLock
/// objects it holds, and exit its Task function.
ExclusionLock try_lock(const Task &task);
};
/// Wrapper around pthread_setname_np which handles length limits
void pthread_setname(const string &name);
/// Wrapper around pthread_getname_np for symmetry
string pthread_getname();
}
#endif // CRUCIBLE_TASK_H

View File

@@ -4,6 +4,8 @@
#include "crucible/error.h"
#include <chrono>
#include <condition_variable>
#include <limits>
#include <mutex>
#include <ostream>
@@ -17,10 +19,9 @@ namespace crucible {
public:
Timer();
double age() const;
chrono::high_resolution_clock::time_point get() const;
double report(int precision = 1000) const;
void reset();
void set(const chrono::high_resolution_clock::time_point &start);
void set(double delta);
double lap();
bool operator<(double d) const;
bool operator>(double d) const;
@@ -32,18 +33,78 @@ namespace crucible {
Timer m_timer;
double m_rate;
double m_burst;
double m_tokens;
mutex m_mutex;
double m_tokens = 0.0;
mutable mutex m_mutex;
void update_tokens();
RateLimiter() = delete;
public:
RateLimiter(double rate, double burst);
RateLimiter(double rate);
void sleep_for(double cost = 1.0);
double sleep_time(double cost = 1.0);
bool is_ready();
void borrow(double cost = 1.0);
void rate(double new_rate);
double rate() const;
};
class RateEstimator {
mutable mutex m_mutex;
mutable condition_variable m_condvar;
Timer m_timer;
double m_num = 0.0;
double m_den = 0.0;
uint64_t m_last_count = numeric_limits<uint64_t>::max();
Timer m_last_update;
const double m_decay = 0.99;
Timer m_last_decay;
double m_min_delay;
double m_max_delay;
chrono::duration<double> duration_unlocked(uint64_t relative_count) const;
chrono::high_resolution_clock::time_point time_point_unlocked(uint64_t absolute_count) const;
double rate_unlocked() const;
pair<double, double> ratio_unlocked() const;
void update_unlocked(uint64_t new_count);
public:
RateEstimator(double min_delay = 1, double max_delay = 3600);
// Block until count reached
void wait_for(uint64_t new_count_relative) const;
void wait_until(uint64_t new_count_absolute) const;
// Computed rates and ratios
double rate() const;
pair<double, double> ratio() const;
// Inspect raw num/den
pair<double, double> raw() const;
// Write count
void update(uint64_t new_count);
// Ignore counts that go backwards
void update_monotonic(uint64_t new_count);
// Read count
uint64_t count() const;
/// Increment count (like update(count() + more), but atomic)
void increment(uint64_t more = 1);
// Convert counts to chrono types
chrono::high_resolution_clock::time_point time_point(uint64_t absolute_count) const;
chrono::duration<double> duration(uint64_t relative_count) const;
// Polling delay until count reached (limited by min/max delay)
double seconds_for(uint64_t new_count_relative) const;
double seconds_until(uint64_t new_count_absolute) const;
};
ostream &
operator<<(ostream &os, const RateEstimator &re);
}
#endif // CRUCIBLE_TIME_H

View File

@@ -1,188 +0,0 @@
#ifndef CRUCIBLE_TIMEQUEUE_H
#define CRUCIBLE_TIMEQUEUE_H
#include <crucible/error.h>
#include <crucible/time.h>
#include <condition_variable>
#include <limits>
#include <list>
#include <memory>
#include <mutex>
#include <set>
namespace crucible {
using namespace std;
template <class Task>
class TimeQueue {
public:
using Timestamp = chrono::high_resolution_clock::time_point;
private:
struct Item {
Timestamp m_time;
unsigned m_id;
Task m_task;
bool operator<(const Item &that) const {
if (m_time < that.m_time) return true;
if (that.m_time < m_time) return false;
return m_id < that.m_id;
}
static unsigned s_id;
Item(const Timestamp &time, const Task& task) :
m_time(time),
m_id(++s_id),
m_task(task)
{
}
};
set<Item> m_set;
mutable mutex m_mutex;
condition_variable m_cond_full, m_cond_empty;
size_t m_max_queue_depth;
public:
~TimeQueue();
TimeQueue(size_t max_queue_depth = numeric_limits<size_t>::max());
void push(const Task &task, double delay = 0);
void push_nowait(const Task &task, double delay = 0);
Task pop();
bool pop_nowait(Task &t);
double when() const;
size_t size() const;
bool empty() const;
list<Task> peek(size_t count) const;
};
template <class Task> unsigned TimeQueue<Task>::Item::s_id = 0;
template <class Task>
TimeQueue<Task>::~TimeQueue()
{
if (!m_set.empty()) {
cerr << "ERROR: " << m_set.size() << " locked items still in TimeQueue at destruction" << endl;
}
}
template <class Task>
void
TimeQueue<Task>::push(const Task &task, double delay)
{
Timestamp time = chrono::high_resolution_clock::now() +
chrono::duration_cast<chrono::high_resolution_clock::duration>(chrono::duration<double>(delay));
unique_lock<mutex> lock(m_mutex);
while (m_set.size() > m_max_queue_depth) {
m_cond_full.wait(lock);
}
m_set.insert(Item(time, task));
m_cond_empty.notify_all();
}
template <class Task>
void
TimeQueue<Task>::push_nowait(const Task &task, double delay)
{
Timestamp time = chrono::high_resolution_clock::now() +
chrono::duration_cast<chrono::high_resolution_clock::duration>(chrono::duration<double>(delay));
unique_lock<mutex> lock(m_mutex);
m_set.insert(Item(time, task));
m_cond_empty.notify_all();
}
template <class Task>
Task
TimeQueue<Task>::pop()
{
unique_lock<mutex> lock(m_mutex);
while (1) {
while (m_set.empty()) {
m_cond_empty.wait(lock);
}
Timestamp now = chrono::high_resolution_clock::now();
if (now > m_set.begin()->m_time) {
Task rv = m_set.begin()->m_task;
m_set.erase(m_set.begin());
m_cond_full.notify_all();
return rv;
}
m_cond_empty.wait_until(lock, m_set.begin()->m_time);
}
}
template <class Task>
bool
TimeQueue<Task>::pop_nowait(Task &t)
{
unique_lock<mutex> lock(m_mutex);
if (m_set.empty()) {
return false;
}
Timestamp now = chrono::high_resolution_clock::now();
if (now <= m_set.begin()->m_time) {
return false;
}
t = m_set.begin()->m_task;
m_set.erase(m_set.begin());
m_cond_full.notify_all();
return true;
}
template <class Task>
double
TimeQueue<Task>::when() const
{
unique_lock<mutex> lock(m_mutex);
if (m_set.empty()) {
return numeric_limits<double>::infinity();
}
return chrono::duration<double>(m_set.begin()->m_time - chrono::high_resolution_clock::now()).count();
}
template <class Task>
size_t
TimeQueue<Task>::size() const
{
unique_lock<mutex> lock(m_mutex);
return m_set.size();
}
template <class Task>
bool
TimeQueue<Task>::empty() const
{
unique_lock<mutex> lock(m_mutex);
return m_set.empty();
}
template <class Task>
list<Task>
TimeQueue<Task>::peek(size_t count) const
{
unique_lock<mutex> lock(m_mutex);
list<Task> rv;
auto it = m_set.begin();
while (count-- && it != m_set.end()) {
rv.push_back(it->m_task);
++it;
}
return rv;
}
template <class Task>
TimeQueue<Task>::TimeQueue(size_t max_depth) :
m_max_queue_depth(max_depth)
{
}
}
#endif // CRUCIBLE_TIMEQUEUE_H

14
include/crucible/uname.h Normal file
View File

@@ -0,0 +1,14 @@
#ifndef CRUCIBLE_UNAME_H
#define CRUCIBLE_UNAME_H
#include <sys/utsname.h>
namespace crucible {
using namespace std;
struct Uname : public utsname {
Uname();
};
}
#endif

View File

@@ -1,14 +0,0 @@
#ifndef CRUCIBLE_UUID_H
#define CRUCIBLE_UUID_H
#include <string>
#include <uuid/uuid.h>
namespace crucible {
using namespace std;
string uuid_unparse(const unsigned char a[16]);
}
#endif // CRUCIBLE_UUID_H

View File

@@ -0,0 +1,8 @@
#ifndef CRUCIBLE_VERSION_H
#define CRUCIBLE_VERSION_H
namespace crucible {
extern const char *VERSION;
}
#endif CRUCIBLE_VERSION_H

View File

@@ -1,189 +0,0 @@
#ifndef CRUCIBLE_WORKQUEUE_H
#define CRUCIBLE_WORKQUEUE_H
#include <crucible/error.h>
#include <condition_variable>
#include <limits>
#include <list>
#include <memory>
#include <mutex>
#include <set>
namespace crucible {
using namespace std;
template <class Task>
class WorkQueue {
public:
using set_type = set<Task>;
using key_type = Task;
private:
set_type m_set;
mutable mutex m_mutex;
condition_variable m_cond_full, m_cond_empty;
size_t m_max_queue_depth;
public:
~WorkQueue();
template <class... Args> WorkQueue(size_t max_queue_depth, Args... args);
template <class... Args> WorkQueue(Args... args);
void push(const key_type &name);
void push_wait(const key_type &name, size_t limit);
void push_nowait(const key_type &name);
key_type pop();
bool pop_nowait(key_type &rv);
key_type peek();
size_t size() const;
bool empty();
set_type copy();
list<Task> peek(size_t count) const;
};
template <class Task>
WorkQueue<Task>::~WorkQueue()
{
if (!m_set.empty()) {
cerr << "ERROR: " << m_set.size() << " locked items still in WorkQueue " << this << " at destruction" << endl;
}
}
template <class Task>
void
WorkQueue<Task>::push(const key_type &name)
{
unique_lock<mutex> lock(m_mutex);
while (!m_set.count(name) && m_set.size() > m_max_queue_depth) {
m_cond_full.wait(lock);
}
m_set.insert(name);
m_cond_empty.notify_all();
}
template <class Task>
void
WorkQueue<Task>::push_wait(const key_type &name, size_t limit)
{
unique_lock<mutex> lock(m_mutex);
while (!m_set.count(name) && m_set.size() >= limit) {
m_cond_full.wait(lock);
}
m_set.insert(name);
m_cond_empty.notify_all();
}
template <class Task>
void
WorkQueue<Task>::push_nowait(const key_type &name)
{
unique_lock<mutex> lock(m_mutex);
m_set.insert(name);
m_cond_empty.notify_all();
}
template <class Task>
typename WorkQueue<Task>::key_type
WorkQueue<Task>::pop()
{
unique_lock<mutex> lock(m_mutex);
while (m_set.empty()) {
m_cond_empty.wait(lock);
}
key_type rv = *m_set.begin();
m_set.erase(m_set.begin());
m_cond_full.notify_all();
return rv;
}
template <class Task>
bool
WorkQueue<Task>::pop_nowait(key_type &rv)
{
unique_lock<mutex> lock(m_mutex);
if (m_set.empty()) {
return false;
}
rv = *m_set.begin();
m_set.erase(m_set.begin());
m_cond_full.notify_all();
return true;
}
template <class Task>
typename WorkQueue<Task>::key_type
WorkQueue<Task>::peek()
{
unique_lock<mutex> lock(m_mutex);
if (m_set.empty()) {
return key_type();
} else {
return *m_set.begin();
}
}
template <class Task>
size_t
WorkQueue<Task>::size() const
{
unique_lock<mutex> lock(m_mutex);
return m_set.size();
}
template <class Task>
bool
WorkQueue<Task>::empty()
{
unique_lock<mutex> lock(m_mutex);
return m_set.empty();
}
template <class Task>
typename WorkQueue<Task>::set_type
WorkQueue<Task>::copy()
{
unique_lock<mutex> lock(m_mutex);
return m_set;
}
template <class Task>
list<Task>
WorkQueue<Task>::peek(size_t count) const
{
unique_lock<mutex> lock(m_mutex);
list<Task> rv;
for (auto i : m_set) {
if (count--) {
rv.push_back(i);
} else {
break;
}
}
return rv;
}
template <class Task>
template <class... Args>
WorkQueue<Task>::WorkQueue(Args... args) :
m_set(args...),
m_max_queue_depth(numeric_limits<size_t>::max())
{
}
template <class Task>
template <class... Args>
WorkQueue<Task>::WorkQueue(size_t max_depth, Args... args) :
m_set(args...),
m_max_queue_depth(max_depth)
{
}
}
#endif // CRUCIBLE_WORKQUEUE_H

1
lib/.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
.version.*

View File

@@ -1,37 +1,45 @@
default: libcrucible.so
default: libcrucible.a
%.a: Makefile
OBJS = \
crc64.o \
CRUCIBLE_OBJS = \
bytevector.o \
btrfs-tree.o \
chatter.o \
city.o \
cleanup.o \
crc64.o \
error.o \
execpipe.o \
extentwalker.o \
fd.o \
fs.o \
interp.o \
multilock.o \
ntoa.o \
openat2.o \
path.o \
process.o \
seeker.o \
string.o \
table.o \
task.o \
time.o \
uuid.o \
uname.o \
include ../makeflags
-include ../localconf
include ../Defines.mk
LDFLAGS = -shared -luuid
BEES_LDFLAGS = $(LDFLAGS)
depends.mk: *.c *.cc
for x in *.c; do $(CC) $(CFLAGS) -M "$$x"; done > depends.mk.new
for x in *.cc; do $(CXX) $(CXXFLAGS) -M "$$x"; done >> depends.mk.new
mv -fv depends.mk.new depends.mk
configure.h: configure.h.in
$(TEMPLATE_COMPILER)
-include depends.mk
%.dep: %.cc configure.h Makefile
$(CXX) $(BEES_CXXFLAGS) -M -MF $@ -MT $(<:.cc=.o) $<
%.o: %.c
$(CC) $(CFLAGS) -o $@ -c $<
include $(CRUCIBLE_OBJS:%.o=%.dep)
%.o: %.cc ../include/crucible/%.h
$(CXX) $(CXXFLAGS) -o $@ -c $<
%.o: %.cc ../makeflags
$(CXX) $(BEES_CXXFLAGS) -o $@ -c $<
libcrucible.so: $(OBJS) Makefile
$(CXX) $(LDFLAGS) -o $@ $(OBJS)
libcrucible.a: $(CRUCIBLE_OBJS)
$(AR) rcs $@ $^

786
lib/btrfs-tree.cc Normal file
View File

@@ -0,0 +1,786 @@
#include "crucible/btrfs-tree.h"
#include "crucible/btrfs.h"
#include "crucible/error.h"
#include "crucible/fs.h"
#include "crucible/hexdump.h"
#include "crucible/seeker.h"
#define CRUCIBLE_BTRFS_TREE_DEBUG(x) do { \
if (BtrfsIoctlSearchKey::s_debug_ostream) { \
(*BtrfsIoctlSearchKey::s_debug_ostream) << x; \
} \
} while (false)
namespace crucible {
using namespace std;
uint64_t
BtrfsTreeItem::extent_begin() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_ITEM_KEY);
return m_objectid;
}
uint64_t
BtrfsTreeItem::extent_end() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_ITEM_KEY);
return m_objectid + m_offset;
}
uint64_t
BtrfsTreeItem::extent_flags() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_ITEM_KEY);
return btrfs_get_member(&btrfs_extent_item::flags, m_data);
}
uint64_t
BtrfsTreeItem::extent_generation() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_ITEM_KEY);
return btrfs_get_member(&btrfs_extent_item::generation, m_data);
}
uint64_t
BtrfsTreeItem::root_ref_dirid() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_BACKREF_KEY);
return btrfs_get_member(&btrfs_root_ref::dirid, m_data);
}
string
BtrfsTreeItem::root_ref_name() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_BACKREF_KEY);
const auto name_len = btrfs_get_member(&btrfs_root_ref::name_len, m_data);
const auto name_start = sizeof(struct btrfs_root_ref);
const auto name_end = name_len + name_start;
THROW_CHECK2(runtime_error, m_data.size(), name_end, m_data.size() >= name_end);
return string(m_data.data() + name_start, m_data.data() + name_end);
}
uint64_t
BtrfsTreeItem::root_ref_parent_rootid() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_BACKREF_KEY);
return offset();
}
uint64_t
BtrfsTreeItem::root_flags() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_ITEM_KEY);
return btrfs_get_member(&btrfs_root_item::flags, m_data);
}
uint64_t
BtrfsTreeItem::root_refs() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_ROOT_ITEM_KEY);
return btrfs_get_member(&btrfs_root_item::refs, m_data);
}
ostream &
operator<<(ostream &os, const BtrfsTreeItem &bti)
{
os << "BtrfsTreeItem {"
<< " objectid = " << to_hex(bti.objectid())
<< ", type = " << btrfs_search_type_ntoa(bti.type())
<< ", offset = " << to_hex(bti.offset())
<< ", transid = " << bti.transid()
<< ", data = ";
hexdump(os, bti.data());
return os;
}
uint64_t
BtrfsTreeItem::block_group_flags() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_BLOCK_GROUP_ITEM_KEY);
return btrfs_get_member(&btrfs_block_group_item::flags, m_data);
}
uint64_t
BtrfsTreeItem::block_group_used() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_BLOCK_GROUP_ITEM_KEY);
return btrfs_get_member(&btrfs_block_group_item::used, m_data);
}
uint64_t
BtrfsTreeItem::chunk_length() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_CHUNK_ITEM_KEY);
return btrfs_get_member(&btrfs_chunk::length, m_data);
}
uint64_t
BtrfsTreeItem::chunk_type() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_CHUNK_ITEM_KEY);
return btrfs_get_member(&btrfs_chunk::type, m_data);
}
uint64_t
BtrfsTreeItem::dev_extent_chunk_offset() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_DEV_EXTENT_KEY);
return btrfs_get_member(&btrfs_dev_extent::chunk_offset, m_data);
}
uint64_t
BtrfsTreeItem::dev_extent_length() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_DEV_EXTENT_KEY);
return btrfs_get_member(&btrfs_dev_extent::length, m_data);
}
uint64_t
BtrfsTreeItem::dev_item_total_bytes() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_DEV_ITEM_KEY);
return btrfs_get_member(&btrfs_dev_item::total_bytes, m_data);
}
uint64_t
BtrfsTreeItem::dev_item_bytes_used() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_DEV_ITEM_KEY);
return btrfs_get_member(&btrfs_dev_item::bytes_used, m_data);
}
uint64_t
BtrfsTreeItem::inode_size() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_INODE_ITEM_KEY);
return btrfs_get_member(&btrfs_inode_item::size, m_data);
}
uint64_t
BtrfsTreeItem::file_extent_logical_bytes() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
const auto file_extent_item_type = btrfs_get_member(&btrfs_file_extent_item::type, m_data);
switch (file_extent_item_type) {
case BTRFS_FILE_EXTENT_INLINE:
return btrfs_get_member(&btrfs_file_extent_item::ram_bytes, m_data);
case BTRFS_FILE_EXTENT_PREALLOC:
case BTRFS_FILE_EXTENT_REG:
return btrfs_get_member(&btrfs_file_extent_item::num_bytes, m_data);
default:
THROW_ERROR(runtime_error, "unknown btrfs_file_extent_item type " << file_extent_item_type);
}
}
uint64_t
BtrfsTreeItem::file_extent_offset() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
const auto file_extent_item_type = btrfs_get_member(&btrfs_file_extent_item::type, m_data);
switch (file_extent_item_type) {
case BTRFS_FILE_EXTENT_INLINE:
THROW_ERROR(invalid_argument, "extent is inline " << *this);
case BTRFS_FILE_EXTENT_PREALLOC:
case BTRFS_FILE_EXTENT_REG:
return btrfs_get_member(&btrfs_file_extent_item::offset, m_data);
default:
THROW_ERROR(runtime_error, "unknown btrfs_file_extent_item type " << file_extent_item_type << " in " << *this);
}
}
uint64_t
BtrfsTreeItem::file_extent_generation() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
return btrfs_get_member(&btrfs_file_extent_item::generation, m_data);
}
uint64_t
BtrfsTreeItem::file_extent_bytenr() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
auto file_extent_item_type = btrfs_get_member(&btrfs_file_extent_item::type, m_data);
switch (file_extent_item_type) {
case BTRFS_FILE_EXTENT_INLINE:
THROW_ERROR(invalid_argument, "extent is inline " << *this);
case BTRFS_FILE_EXTENT_PREALLOC:
case BTRFS_FILE_EXTENT_REG:
return btrfs_get_member(&btrfs_file_extent_item::disk_bytenr, m_data);
default:
THROW_ERROR(runtime_error, "unknown btrfs_file_extent_item type " << file_extent_item_type << " in " << *this);
}
}
uint8_t
BtrfsTreeItem::file_extent_type() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
return btrfs_get_member(&btrfs_file_extent_item::type, m_data);
}
btrfs_compression_type
BtrfsTreeItem::file_extent_compression() const
{
THROW_CHECK1(invalid_argument, btrfs_search_type_ntoa(m_type), m_type == BTRFS_EXTENT_DATA_KEY);
return static_cast<btrfs_compression_type>(btrfs_get_member(&btrfs_file_extent_item::compression, m_data));
}
BtrfsTreeItem::BtrfsTreeItem(const BtrfsIoctlSearchHeader &bish) :
m_objectid(bish.objectid),
m_offset(bish.offset),
m_transid(bish.transid),
m_data(bish.m_data),
m_type(bish.type)
{
}
BtrfsTreeItem &
BtrfsTreeItem::operator=(const BtrfsIoctlSearchHeader &bish)
{
m_objectid = bish.objectid;
m_offset = bish.offset;
m_transid = bish.transid;
m_data = bish.m_data;
m_type = bish.type;
return *this;
}
bool
BtrfsTreeItem::operator!() const
{
return m_transid == 0 && m_objectid == 0 && m_offset == 0 && m_type == 0;
}
uint64_t
BtrfsTreeFetcher::block_size() const
{
return m_block_size;
}
BtrfsTreeFetcher::BtrfsTreeFetcher(Fd new_fd) :
m_fd(new_fd)
{
BtrfsIoctlFsInfoArgs bifia;
bifia.do_ioctl(fd());
m_block_size = bifia.sectorsize;
THROW_CHECK1(runtime_error, m_block_size, m_block_size > 0);
// We don't believe sector sizes that aren't multiples of 4K
THROW_CHECK1(runtime_error, m_block_size, (m_block_size % 4096) == 0);
m_lookbehind_size = 128 * 1024;
m_scale_size = m_block_size;
}
Fd
BtrfsTreeFetcher::fd() const
{
return m_fd;
}
void
BtrfsTreeFetcher::fd(Fd fd)
{
m_fd = fd;
}
void
BtrfsTreeFetcher::type(uint8_t type)
{
m_type = type;
}
uint8_t
BtrfsTreeFetcher::type()
{
return m_type;
}
void
BtrfsTreeFetcher::tree(uint64_t tree)
{
m_tree = tree;
}
uint64_t
BtrfsTreeFetcher::tree()
{
return m_tree;
}
void
BtrfsTreeFetcher::transid(uint64_t min_transid, uint64_t max_transid)
{
m_min_transid = min_transid;
m_max_transid = max_transid;
}
uint64_t
BtrfsTreeFetcher::lookbehind_size() const
{
return m_lookbehind_size;
}
void
BtrfsTreeFetcher::lookbehind_size(uint64_t lookbehind_size)
{
m_lookbehind_size = lookbehind_size;
}
uint64_t
BtrfsTreeFetcher::scale_size() const
{
return m_scale_size;
}
void
BtrfsTreeFetcher::scale_size(uint64_t scale_size)
{
m_scale_size = scale_size;
}
void
BtrfsTreeFetcher::fill_sk(BtrfsIoctlSearchKey &sk, uint64_t object)
{
(void)object;
// btrfs allows tree ID 0 meaning the current tree, but we do not.
THROW_CHECK0(invalid_argument, m_tree != 0);
sk.tree_id = m_tree;
sk.min_type = m_type;
sk.max_type = m_type;
sk.min_transid = m_min_transid;
sk.max_transid = m_max_transid;
sk.nr_items = 1;
}
void
BtrfsTreeFetcher::next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr)
{
key.next_min(hdr, m_type);
}
BtrfsTreeItem
BtrfsTreeFetcher::at(uint64_t logical)
{
CRUCIBLE_BTRFS_TREE_DEBUG("at " << logical);
BtrfsIoctlSearchKey &sk = m_sk;
fill_sk(sk, logical);
// Exact match, should return 0 or 1 items
sk.max_type = sk.min_type;
sk.nr_items = 1;
sk.do_ioctl(fd());
THROW_CHECK1(runtime_error, sk.m_result.size(), sk.m_result.size() < 2);
for (const auto &i : sk.m_result) {
if (hdr_logical(i) == logical && hdr_match(i)) {
return i;
}
}
return BtrfsTreeItem();
}
uint64_t
BtrfsTreeFetcher::scale_logical(const uint64_t logical) const
{
THROW_CHECK1(invalid_argument, logical, (logical % m_scale_size) == 0 || logical == s_max_logical);
return logical / m_scale_size;
}
uint64_t
BtrfsTreeFetcher::scaled_max_logical() const
{
return scale_logical(s_max_logical);
}
uint64_t
BtrfsTreeFetcher::unscale_logical(const uint64_t logical) const
{
THROW_CHECK1(invalid_argument, logical, logical <= scaled_max_logical());
if (logical == scaled_max_logical()) {
return s_max_logical;
}
return logical * scale_size();
}
BtrfsTreeItem
BtrfsTreeFetcher::rlower_bound(uint64_t logical)
{
#if 0
static bool btfrlb_debug = getenv("BTFLRB_DEBUG");
#define BTFRLB_DEBUG(x) do { if (btfrlb_debug) cerr << x; } while (false)
#else
#define BTFRLB_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
#endif
BtrfsTreeItem closest_item;
uint64_t closest_logical = 0;
BtrfsIoctlSearchKey &sk = m_sk;
size_t loops = 0;
BTFRLB_DEBUG("rlower_bound: " << to_hex(logical) << " in tree " << tree() << endl);
seek_backward(scale_logical(logical), [&](uint64_t const lower_bound, uint64_t const upper_bound) {
++loops;
fill_sk(sk, unscale_logical(min(scaled_max_logical(), lower_bound)));
set<uint64_t> rv;
bool too_far = false;
do {
sk.nr_items = 4;
sk.do_ioctl(fd());
BTFRLB_DEBUG("fetch: loop " << loops << " lower_bound..upper_bound " << to_hex(lower_bound) << ".." << to_hex(upper_bound));
for (auto &i : sk.m_result) {
next_sk(sk, i);
// If hdr_stop or !hdr_match, don't inspect the item
if (hdr_stop(i)) {
too_far = true;
rv.insert(numeric_limits<uint64_t>::max());
BTFRLB_DEBUG("(stop)");
break;
}
if (!hdr_match(i)) {
BTFRLB_DEBUG("(no match)");
continue;
}
const auto this_logical = hdr_logical(i);
BTFRLB_DEBUG(" " << to_hex(this_logical) << " " << i);
const auto scaled_hdr_logical = scale_logical(this_logical);
BTFRLB_DEBUG(" " << "(match)");
if (scaled_hdr_logical > upper_bound) {
too_far = true;
BTFRLB_DEBUG("(" << to_hex(scaled_hdr_logical) << " >= " << to_hex(upper_bound) << ")");
break;
}
if (this_logical <= logical && this_logical > closest_logical) {
closest_logical = this_logical;
closest_item = i;
BTFRLB_DEBUG("(closest)");
}
rv.insert(scaled_hdr_logical);
BTFRLB_DEBUG("(cont'd)");
}
BTFRLB_DEBUG(endl);
// We might get a search result that contains only non-matching items.
// Keep looping until we find any matching item or we run out of tree.
} while (!too_far && rv.empty() && !sk.m_result.empty());
return rv;
}, scale_logical(lookbehind_size()));
return closest_item;
#undef BTFRLB_DEBUG
}
BtrfsTreeItem
BtrfsTreeFetcher::lower_bound(uint64_t logical)
{
BtrfsIoctlSearchKey &sk = m_sk;
fill_sk(sk, logical);
do {
assert(sk.max_offset == s_max_logical);
sk.do_ioctl(fd());
for (const auto &i : sk.m_result) {
if (hdr_match(i)) {
return i;
}
if (hdr_stop(i)) {
return BtrfsTreeItem();
}
next_sk(sk, i);
}
} while (!sk.m_result.empty());
return BtrfsTreeItem();
}
BtrfsTreeItem
BtrfsTreeFetcher::next(uint64_t logical)
{
CRUCIBLE_BTRFS_TREE_DEBUG("next " << logical);
const auto scaled_logical = scale_logical(logical);
if (scaled_logical + 1 > scaled_max_logical()) {
return BtrfsTreeItem();
}
return lower_bound(unscale_logical(scaled_logical + 1));
}
BtrfsTreeItem
BtrfsTreeFetcher::prev(uint64_t logical)
{
CRUCIBLE_BTRFS_TREE_DEBUG("prev " << logical);
const auto scaled_logical = scale_logical(logical);
if (scaled_logical < 1) {
return BtrfsTreeItem();
}
return rlower_bound(unscale_logical(scaled_logical - 1));
}
void
BtrfsTreeObjectFetcher::fill_sk(BtrfsIoctlSearchKey &sk, uint64_t object)
{
BtrfsTreeFetcher::fill_sk(sk, object);
sk.min_offset = 0;
sk.max_offset = numeric_limits<decltype(sk.max_offset)>::max();
sk.min_objectid = object;
sk.max_objectid = numeric_limits<decltype(sk.max_objectid)>::max();
}
uint64_t
BtrfsTreeObjectFetcher::hdr_logical(const BtrfsIoctlSearchHeader &hdr)
{
return hdr.objectid;
}
bool
BtrfsTreeObjectFetcher::hdr_match(const BtrfsIoctlSearchHeader &hdr)
{
// If you're calling this method without overriding it, you should have set type first
assert(m_type);
return hdr.type == m_type;
}
bool
BtrfsTreeObjectFetcher::hdr_stop(const BtrfsIoctlSearchHeader &hdr)
{
return false;
(void)hdr;
}
uint64_t
BtrfsTreeOffsetFetcher::hdr_logical(const BtrfsIoctlSearchHeader &hdr)
{
return hdr.offset;
}
bool
BtrfsTreeOffsetFetcher::hdr_match(const BtrfsIoctlSearchHeader &hdr)
{
assert(m_type);
return hdr.type == m_type && hdr.objectid == m_objectid;
}
bool
BtrfsTreeOffsetFetcher::hdr_stop(const BtrfsIoctlSearchHeader &hdr)
{
assert(m_type);
return hdr.objectid > m_objectid || hdr.type > m_type;
}
void
BtrfsTreeOffsetFetcher::objectid(uint64_t objectid)
{
m_objectid = objectid;
}
uint64_t
BtrfsTreeOffsetFetcher::objectid() const
{
return m_objectid;
}
void
BtrfsTreeOffsetFetcher::fill_sk(BtrfsIoctlSearchKey &sk, uint64_t offset)
{
BtrfsTreeFetcher::fill_sk(sk, offset);
sk.min_offset = offset;
sk.max_offset = numeric_limits<decltype(sk.max_offset)>::max();
sk.min_objectid = m_objectid;
sk.max_objectid = m_objectid;
}
void
BtrfsCsumTreeFetcher::get_sums(uint64_t const logical, size_t count, function<void(uint64_t logical, const uint8_t *buf, size_t bytes)> output)
{
#if 0
static bool bctfgs_debug = getenv("BCTFGS_DEBUG");
#define BCTFGS_DEBUG(x) do { if (bctfgs_debug) cerr << x; } while (false)
#else
#define BCTFGS_DEBUG(x) CRUCIBLE_BTRFS_TREE_DEBUG(x)
#endif
const uint64_t logical_end = logical + count * block_size();
BtrfsTreeItem bti = rlower_bound(logical);
size_t __attribute__((unused)) loops = 0;
BCTFGS_DEBUG("get_sums " << to_hex(logical) << ".." << to_hex(logical_end) << endl);
while (!!bti) {
BCTFGS_DEBUG("get_sums[" << loops << "]: " << bti << endl);
++loops;
// Reject wrong type or objectid
THROW_CHECK1(runtime_error, bti.type(), bti.type() == BTRFS_EXTENT_CSUM_KEY);
THROW_CHECK1(runtime_error, bti.objectid(), bti.objectid() == BTRFS_EXTENT_CSUM_OBJECTID);
// Is this object in range?
const uint64_t data_logical = bti.offset();
if (data_logical >= logical_end) {
// csum object is past end of range, we are done
return;
}
// Figure out how long this csum item is in various units
const size_t csum_byte_count = bti.data().size();
THROW_CHECK1(runtime_error, csum_byte_count, (csum_byte_count % m_sum_size) == 0);
THROW_CHECK1(runtime_error, csum_byte_count, csum_byte_count > 0);
const size_t csum_count = csum_byte_count / m_sum_size;
const uint64_t data_byte_count = csum_count * block_size();
const uint64_t data_logical_end = data_logical + data_byte_count;
if (data_logical_end <= logical) {
// too low, look at next item
bti = lower_bound(logical);
continue;
}
// There is some overlap?
const uint64_t overlap_begin = max(logical, data_logical);
const uint64_t overlap_end = min(logical_end, data_logical_end);
THROW_CHECK2(runtime_error, overlap_begin, overlap_end, overlap_begin < overlap_end);
const uint64_t overlap_offset = overlap_begin - data_logical;
THROW_CHECK1(runtime_error, overlap_offset, (overlap_offset % block_size()) == 0);
const uint64_t overlap_index = overlap_offset * m_sum_size / block_size();
const uint64_t overlap_byte_count = overlap_end - overlap_begin;
const uint64_t overlap_csum_byte_count = overlap_byte_count * m_sum_size / block_size();
// Can't be bigger than a btrfs item
THROW_CHECK1(runtime_error, overlap_index, overlap_index < 65536);
THROW_CHECK1(runtime_error, overlap_csum_byte_count, overlap_csum_byte_count < 65536);
// Yes, process the overlap
output(overlap_begin, bti.data().data() + overlap_index, overlap_csum_byte_count);
// Advance
bti = lower_bound(overlap_end);
}
#undef BCTFGS_DEBUG
}
uint32_t
BtrfsCsumTreeFetcher::sum_type() const
{
return m_sum_type;
}
size_t
BtrfsCsumTreeFetcher::sum_size() const
{
return m_sum_size;
}
BtrfsCsumTreeFetcher::BtrfsCsumTreeFetcher(const Fd &new_fd) :
BtrfsTreeOffsetFetcher(new_fd)
{
type(BTRFS_EXTENT_CSUM_KEY);
tree(BTRFS_CSUM_TREE_OBJECTID);
objectid(BTRFS_EXTENT_CSUM_OBJECTID);
BtrfsIoctlFsInfoArgs bifia;
bifia.do_ioctl(fd());
m_sum_type = static_cast<btrfs_compression_type>(bifia.csum_type());
m_sum_size = bifia.csum_size();
if (m_sum_type == BTRFS_CSUM_TYPE_CRC32 && m_sum_size == 0) {
// Older kernel versions don't fill in this field
m_sum_size = 4;
}
THROW_CHECK1(runtime_error, m_sum_size, m_sum_size > 0);
}
BtrfsExtentItemFetcher::BtrfsExtentItemFetcher(const Fd &new_fd) :
BtrfsTreeObjectFetcher(new_fd)
{
tree(BTRFS_EXTENT_TREE_OBJECTID);
type(BTRFS_EXTENT_ITEM_KEY);
}
BtrfsExtentDataFetcher::BtrfsExtentDataFetcher(const Fd &new_fd) :
BtrfsTreeOffsetFetcher(new_fd)
{
type(BTRFS_EXTENT_DATA_KEY);
}
BtrfsInodeFetcher::BtrfsInodeFetcher(const Fd &fd) :
BtrfsTreeObjectFetcher(fd)
{
type(BTRFS_INODE_ITEM_KEY);
scale_size(1);
}
BtrfsTreeItem
BtrfsInodeFetcher::stat(uint64_t subvol, uint64_t inode)
{
tree(subvol);
const auto item = at(inode);
if (!!item) {
THROW_CHECK2(runtime_error, item.objectid(), inode, inode == item.objectid());
THROW_CHECK2(runtime_error, item.type(), BTRFS_INODE_ITEM_KEY, item.type() == BTRFS_INODE_ITEM_KEY);
}
return item;
}
BtrfsRootFetcher::BtrfsRootFetcher(const Fd &fd) :
BtrfsTreeObjectFetcher(fd)
{
tree(BTRFS_ROOT_TREE_OBJECTID);
scale_size(1);
}
BtrfsTreeItem
BtrfsRootFetcher::root(const uint64_t subvol)
{
const auto my_type = BTRFS_ROOT_ITEM_KEY;
type(my_type);
const auto item = at(subvol);
if (!!item) {
THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
}
return item;
}
BtrfsTreeItem
BtrfsRootFetcher::root_backref(const uint64_t subvol)
{
const auto my_type = BTRFS_ROOT_BACKREF_KEY;
type(my_type);
const auto item = at(subvol);
if (!!item) {
THROW_CHECK2(runtime_error, item.objectid(), subvol, subvol == item.objectid());
THROW_CHECK2(runtime_error, item.type(), my_type, item.type() == my_type);
}
return item;
}
BtrfsDataExtentTreeFetcher::BtrfsDataExtentTreeFetcher(const Fd &fd) :
BtrfsExtentItemFetcher(fd),
m_chunk_tree(fd)
{
tree(BTRFS_EXTENT_TREE_OBJECTID);
type(BTRFS_EXTENT_ITEM_KEY);
m_chunk_tree.tree(BTRFS_CHUNK_TREE_OBJECTID);
m_chunk_tree.type(BTRFS_CHUNK_ITEM_KEY);
m_chunk_tree.objectid(BTRFS_FIRST_CHUNK_TREE_OBJECTID);
}
void
BtrfsDataExtentTreeFetcher::next_sk(BtrfsIoctlSearchKey &key, const BtrfsIoctlSearchHeader &hdr)
{
key.min_type = key.max_type = type();
key.max_objectid = key.max_offset = numeric_limits<uint64_t>::max();
key.min_offset = 0;
key.min_objectid = hdr.objectid;
const auto step = scale_size();
if (key.min_objectid < numeric_limits<uint64_t>::max() - step) {
key.min_objectid += step;
} else {
key.min_objectid = numeric_limits<uint64_t>::max();
}
// If we're still in our current block group, check here
if (!!m_current_bg) {
const auto bg_begin = m_current_bg.offset();
const auto bg_end = bg_begin + m_current_bg.chunk_length();
// If we are still in our current block group, return early
if (key.min_objectid >= bg_begin && key.min_objectid < bg_end) return;
}
// We don't have a current block group or we're out of range
// Find the chunk that this bytenr belongs to
m_current_bg = m_chunk_tree.rlower_bound(key.min_objectid);
// Make sure it's a data block group
while (!!m_current_bg) {
// Data block group, stop here
if (m_current_bg.chunk_type() & BTRFS_BLOCK_GROUP_DATA) break;
// Not a data block group, skip to end
key.min_objectid = m_current_bg.offset() + m_current_bg.chunk_length();
m_current_bg = m_chunk_tree.lower_bound(key.min_objectid);
}
if (!m_current_bg) {
// Ran out of data block groups, stop here
return;
}
// Check to see if bytenr is in the current data block group
const auto bg_begin = m_current_bg.offset();
if (key.min_objectid < bg_begin) {
// Move forward to start of data block group
key.min_objectid = bg_begin;
}
}
}

189
lib/bytevector.cc Normal file
View File

@@ -0,0 +1,189 @@
#include "crucible/bytevector.h"
#include "crucible/error.h"
#include "crucible/hexdump.h"
#include "crucible/string.h"
#include <cassert>
namespace crucible {
using namespace std;
ByteVector::iterator
ByteVector::begin() const
{
unique_lock<mutex> lock(m_mutex);
return m_ptr.get();
}
ByteVector::iterator
ByteVector::end() const
{
unique_lock<mutex> lock(m_mutex);
return m_ptr.get() + m_size;
}
size_t
ByteVector::size() const
{
return m_size;
}
bool
ByteVector::empty() const
{
return !m_ptr || !m_size;
}
void
ByteVector::clear()
{
unique_lock<mutex> lock(m_mutex);
m_ptr.reset();
m_size = 0;
}
ByteVector::value_type&
ByteVector::operator[](size_t index) const
{
unique_lock<mutex> lock(m_mutex);
return m_ptr.get()[index];
}
ByteVector::ByteVector(const ByteVector &that)
{
unique_lock<mutex> lock(that.m_mutex);
m_ptr = that.m_ptr;
m_size = that.m_size;
}
ByteVector&
ByteVector::operator=(const ByteVector &that)
{
// If &that == this, there's no need to do anything, but
// especially don't try to lock the same mutex twice.
if (&m_mutex != &that.m_mutex) {
unique_lock<mutex> lock_this(m_mutex, defer_lock);
unique_lock<mutex> lock_that(that.m_mutex, defer_lock);
lock(lock_this, lock_that);
m_ptr = that.m_ptr;
m_size = that.m_size;
}
return *this;
}
ByteVector::ByteVector(const ByteVector &that, size_t start, size_t length)
{
THROW_CHECK0(out_of_range, that.m_ptr);
THROW_CHECK2(out_of_range, start, that.m_size, start <= that.m_size);
THROW_CHECK2(out_of_range, start + length, that.m_size + length, start + length <= that.m_size + length);
m_ptr = Pointer(that.m_ptr, that.m_ptr.get() + start);
m_size = length;
}
ByteVector
ByteVector::at(size_t start, size_t length) const
{
return ByteVector(*this, start, length);
}
ByteVector::value_type&
ByteVector::at(size_t size) const
{
unique_lock<mutex> lock(m_mutex);
THROW_CHECK0(out_of_range, m_ptr);
THROW_CHECK2(out_of_range, size, m_size, size < m_size);
return m_ptr.get()[size];
}
static
void *
bv_allocate(size_t size)
{
#ifdef BEES_VALGRIND
// XXX: only do this to shut up valgrind
return calloc(1, size);
#else
return malloc(size);
#endif
}
ByteVector::ByteVector(size_t size)
{
m_ptr = Pointer(static_cast<value_type*>(bv_allocate(size)), free);
// bad_alloc doesn't fit THROW_CHECK's template
THROW_CHECK0(runtime_error, m_ptr);
m_size = size;
}
ByteVector::ByteVector(iterator begin, iterator end, size_t min_size)
{
const size_t size = end - begin;
const size_t alloc_size = max(size, min_size);
m_ptr = Pointer(static_cast<value_type*>(bv_allocate(alloc_size)), free);
THROW_CHECK0(runtime_error, m_ptr);
m_size = alloc_size;
memcpy(m_ptr.get(), begin, size);
}
bool
ByteVector::operator==(const ByteVector &that) const
{
unique_lock<mutex> lock_this(m_mutex, defer_lock);
unique_lock<mutex> lock_that(that.m_mutex, defer_lock);
lock(lock_this, lock_that);
if (!m_ptr) {
return !that.m_ptr;
}
if (!that.m_ptr) {
return false;
}
if (m_size != that.m_size) {
return false;
}
if (m_ptr.get() == that.m_ptr.get()) {
return true;
}
return !memcmp(m_ptr.get(), that.m_ptr.get(), m_size);
}
void
ByteVector::erase(iterator begin, iterator end)
{
unique_lock<mutex> lock(m_mutex);
const size_t size = end - begin;
if (!size) return;
THROW_CHECK0(out_of_range, m_ptr);
const iterator my_begin = m_ptr.get();
const iterator my_end = my_begin + m_size;
THROW_CHECK4(out_of_range, my_begin, begin, my_end, end, my_begin == begin || my_end == end);
if (begin == my_begin) {
if (end == my_end) {
m_size = 0;
m_ptr.reset();
return;
}
m_ptr = Pointer(m_ptr, end);
}
m_size -= size;
}
void
ByteVector::erase(iterator begin)
{
erase(begin, begin + 1);
}
ByteVector::value_type*
ByteVector::data() const
{
unique_lock<mutex> lock(m_mutex);
return m_ptr.get();
}
ostream&
operator<<(ostream &os, const ByteVector &bv) {
hexdump(os, bv);
return os;
}
}

View File

@@ -15,8 +15,10 @@
namespace crucible {
using namespace std;
static auto_ptr<set<string>> chatter_names;
static shared_ptr<set<string>> chatter_names;
static const char *SPACETAB = " \t";
static bool add_prefix_timestamp = true;
static bool add_prefix_level = true;
static
void
@@ -43,28 +45,52 @@ namespace crucible {
}
}
Chatter::Chatter(string name, ostream &os)
: m_name(name), m_os(os)
Chatter::Chatter(int loglevel, string name, ostream &os)
: m_loglevel(loglevel), m_name(name), m_os(os)
{
}
void
Chatter::enable_timestamp(bool prefix_timestamp)
{
add_prefix_timestamp = prefix_timestamp;
}
void
Chatter::enable_level(bool prefix_level)
{
add_prefix_level = prefix_level;
}
Chatter::~Chatter()
{
ostringstream header_stream;
time_t ltime;
DIE_IF_MINUS_ONE(time(&ltime));
struct tm ltm;
DIE_IF_ZERO(localtime_r(&ltime, &ltm));
if (add_prefix_timestamp) {
time_t ltime;
DIE_IF_MINUS_ONE(time(&ltime));
struct tm ltm;
DIE_IF_ZERO(localtime_r(&ltime, &ltm));
char buf[1024];
DIE_IF_ZERO(strftime(buf, sizeof(buf), "%Y-%m-%d %H:%M:%S", &ltm));
char buf[1024];
DIE_IF_ZERO(strftime(buf, sizeof(buf), "%Y-%m-%d %H:%M:%S", &ltm));
header_stream << buf;
header_stream << " " << getpid() << "." << gettid();
if (!m_name.empty()) {
header_stream << " " << m_name;
header_stream << buf;
header_stream << " " << getpid() << "." << gettid();
if (add_prefix_level) {
header_stream << "<" << m_loglevel << ">";
}
if (!m_name.empty()) {
header_stream << " " << m_name;
}
} else {
if (add_prefix_level) {
header_stream << "<" << m_loglevel << ">";
}
header_stream << (m_name.empty() ? "thread" : m_name);
header_stream << "[" << gettid() << "]";
}
header_stream << ": ";
string out = m_oss.str();
@@ -86,7 +112,7 @@ namespace crucible {
}
Chatter::Chatter(Chatter &&c)
: m_name(c.m_name), m_os(c.m_os), m_oss(c.m_oss.str())
: m_loglevel(c.m_loglevel), m_name(c.m_name), m_os(c.m_os), m_oss(c.m_oss.str())
{
c.m_oss.str("");
}
@@ -110,6 +136,7 @@ namespace crucible {
} else if (!chatter_names->empty()) {
cerr << "CRUCIBLE_CHATTER does not list '" << m_file << "' or '" << m_pretty_function << "'" << endl;
}
(void)m_line; // not implemented yet
// cerr << "ChatterBox " << reinterpret_cast<void*>(this) << " constructed" << endl;
}
@@ -132,7 +159,7 @@ namespace crucible {
ChatterUnwinder::~ChatterUnwinder()
{
if (uncaught_exception()) {
if (current_exception()) {
m_func();
}
}

513
lib/city.cc Normal file
View File

@@ -0,0 +1,513 @@
// Copyright (c) 2011 Google, Inc.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//
// CityHash, by Geoff Pike and Jyrki Alakuijala
//
// This file provides CityHash64() and related functions.
//
// It's probably possible to create even faster hash functions by
// writing a program that systematically explores some of the space of
// possible hash functions, by using SIMD instructions, or by
// compromising on hash quality.
#include "crucible/city.h"
#include <algorithm>
#include <string.h> // for memcpy and memset
using namespace std;
static uint64 UNALIGNED_LOAD64(const char *p) {
uint64 result;
memcpy(&result, p, sizeof(result));
return result;
}
static uint32 UNALIGNED_LOAD32(const char *p) {
uint32 result;
memcpy(&result, p, sizeof(result));
return result;
}
#ifdef _MSC_VER
#include <stdlib.h>
#define bswap_32(x) _byteswap_ulong(x)
#define bswap_64(x) _byteswap_uint64(x)
#elif defined(__APPLE__)
// Mac OS X / Darwin features
#include <libkern/OSByteOrder.h>
#define bswap_32(x) OSSwapInt32(x)
#define bswap_64(x) OSSwapInt64(x)
#elif defined(__sun) || defined(sun)
#include <sys/byteorder.h>
#define bswap_32(x) BSWAP_32(x)
#define bswap_64(x) BSWAP_64(x)
#elif defined(__FreeBSD__)
#include <sys/endian.h>
#define bswap_32(x) bswap32(x)
#define bswap_64(x) bswap64(x)
#elif defined(__OpenBSD__)
#include <sys/types.h>
#define bswap_32(x) swap32(x)
#define bswap_64(x) swap64(x)
#elif defined(__NetBSD__)
#include <sys/types.h>
#include <machine/bswap.h>
#if defined(__BSWAP_RENAME) && !defined(__bswap_32)
#define bswap_32(x) bswap32(x)
#define bswap_64(x) bswap64(x)
#endif
#else
#include <byteswap.h>
#endif
#ifdef WORDS_BIGENDIAN
#define uint32_in_expected_order(x) (bswap_32(x))
#define uint64_in_expected_order(x) (bswap_64(x))
#else
#define uint32_in_expected_order(x) (x)
#define uint64_in_expected_order(x) (x)
#endif
#if !defined(LIKELY)
#if HAVE_BUILTIN_EXPECT
#define LIKELY(x) (__builtin_expect(!!(x), 1))
#else
#define LIKELY(x) (x)
#endif
#endif
static uint64 Fetch64(const char *p) {
return uint64_in_expected_order(UNALIGNED_LOAD64(p));
}
static uint32 Fetch32(const char *p) {
return uint32_in_expected_order(UNALIGNED_LOAD32(p));
}
// Some primes between 2^63 and 2^64 for various uses.
static const uint64 k0 = 0xc3a5c85c97cb3127ULL;
static const uint64 k1 = 0xb492b66fbe98f273ULL;
static const uint64 k2 = 0x9ae16a3b2f90404fULL;
// Magic numbers for 32-bit hashing. Copied from Murmur3.
static const uint32 c1 = 0xcc9e2d51;
static const uint32 c2 = 0x1b873593;
// A 32-bit to 32-bit integer hash copied from Murmur3.
static uint32 fmix(uint32 h)
{
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
return h;
}
static uint32 Rotate32(uint32 val, int shift) {
// Avoid shifting by 32: doing so yields an undefined result.
return shift == 0 ? val : ((val >> shift) | (val << (32 - shift)));
}
#undef PERMUTE3
#define PERMUTE3(a, b, c) do { std::swap(a, b); std::swap(a, c); } while (0)
static uint32 Mur(uint32 a, uint32 h) {
// Helper from Murmur3 for combining two 32-bit values.
a *= c1;
a = Rotate32(a, 17);
a *= c2;
h ^= a;
h = Rotate32(h, 19);
return h * 5 + 0xe6546b64;
}
static uint32 Hash32Len13to24(const char *s, size_t len) {
uint32 a = Fetch32(s - 4 + (len >> 1));
uint32 b = Fetch32(s + 4);
uint32 c = Fetch32(s + len - 8);
uint32 d = Fetch32(s + (len >> 1));
uint32 e = Fetch32(s);
uint32 f = Fetch32(s + len - 4);
uint32 h = len;
return fmix(Mur(f, Mur(e, Mur(d, Mur(c, Mur(b, Mur(a, h)))))));
}
static uint32 Hash32Len0to4(const char *s, size_t len) {
uint32 b = 0;
uint32 c = 9;
for (size_t i = 0; i < len; i++) {
signed char v = s[i];
b = b * c1 + v;
c ^= b;
}
return fmix(Mur(b, Mur(len, c)));
}
static uint32 Hash32Len5to12(const char *s, size_t len) {
uint32 a = len, b = len * 5, c = 9, d = b;
a += Fetch32(s);
b += Fetch32(s + len - 4);
c += Fetch32(s + ((len >> 1) & 4));
return fmix(Mur(c, Mur(b, Mur(a, d))));
}
uint32 CityHash32(const char *s, size_t len) {
if (len <= 24) {
return len <= 12 ?
(len <= 4 ? Hash32Len0to4(s, len) : Hash32Len5to12(s, len)) :
Hash32Len13to24(s, len);
}
// len > 24
uint32 h = len, g = c1 * len, f = g;
uint32 a0 = Rotate32(Fetch32(s + len - 4) * c1, 17) * c2;
uint32 a1 = Rotate32(Fetch32(s + len - 8) * c1, 17) * c2;
uint32 a2 = Rotate32(Fetch32(s + len - 16) * c1, 17) * c2;
uint32 a3 = Rotate32(Fetch32(s + len - 12) * c1, 17) * c2;
uint32 a4 = Rotate32(Fetch32(s + len - 20) * c1, 17) * c2;
h ^= a0;
h = Rotate32(h, 19);
h = h * 5 + 0xe6546b64;
h ^= a2;
h = Rotate32(h, 19);
h = h * 5 + 0xe6546b64;
g ^= a1;
g = Rotate32(g, 19);
g = g * 5 + 0xe6546b64;
g ^= a3;
g = Rotate32(g, 19);
g = g * 5 + 0xe6546b64;
f += a4;
f = Rotate32(f, 19);
f = f * 5 + 0xe6546b64;
size_t iters = (len - 1) / 20;
do {
uint32 a0 = Rotate32(Fetch32(s) * c1, 17) * c2;
uint32 a1 = Fetch32(s + 4);
uint32 a2 = Rotate32(Fetch32(s + 8) * c1, 17) * c2;
uint32 a3 = Rotate32(Fetch32(s + 12) * c1, 17) * c2;
uint32 a4 = Fetch32(s + 16);
h ^= a0;
h = Rotate32(h, 18);
h = h * 5 + 0xe6546b64;
f += a1;
f = Rotate32(f, 19);
f = f * c1;
g += a2;
g = Rotate32(g, 18);
g = g * 5 + 0xe6546b64;
h ^= a3 + a1;
h = Rotate32(h, 19);
h = h * 5 + 0xe6546b64;
g ^= a4;
g = bswap_32(g) * 5;
h += a4 * 5;
h = bswap_32(h);
f += a0;
PERMUTE3(f, h, g);
s += 20;
} while (--iters != 0);
g = Rotate32(g, 11) * c1;
g = Rotate32(g, 17) * c1;
f = Rotate32(f, 11) * c1;
f = Rotate32(f, 17) * c1;
h = Rotate32(h + g, 19);
h = h * 5 + 0xe6546b64;
h = Rotate32(h, 17) * c1;
h = Rotate32(h + f, 19);
h = h * 5 + 0xe6546b64;
h = Rotate32(h, 17) * c1;
return h;
}
// Bitwise right rotate. Normally this will compile to a single
// instruction, especially if the shift is a manifest constant.
static uint64 Rotate(uint64 val, int shift) {
// Avoid shifting by 64: doing so yields an undefined result.
return shift == 0 ? val : ((val >> shift) | (val << (64 - shift)));
}
static uint64 ShiftMix(uint64 val) {
return val ^ (val >> 47);
}
static uint64 HashLen16(uint64 u, uint64 v) {
return Hash128to64(uint128(u, v));
}
static uint64 HashLen16(uint64 u, uint64 v, uint64 mul) {
// Murmur-inspired hashing.
uint64 a = (u ^ v) * mul;
a ^= (a >> 47);
uint64 b = (v ^ a) * mul;
b ^= (b >> 47);
b *= mul;
return b;
}
static uint64 HashLen0to16(const char *s, size_t len) {
if (len >= 8) {
uint64 mul = k2 + len * 2;
uint64 a = Fetch64(s) + k2;
uint64 b = Fetch64(s + len - 8);
uint64 c = Rotate(b, 37) * mul + a;
uint64 d = (Rotate(a, 25) + b) * mul;
return HashLen16(c, d, mul);
}
if (len >= 4) {
uint64 mul = k2 + len * 2;
uint64 a = Fetch32(s);
return HashLen16(len + (a << 3), Fetch32(s + len - 4), mul);
}
if (len > 0) {
uint8 a = s[0];
uint8 b = s[len >> 1];
uint8 c = s[len - 1];
uint32 y = static_cast<uint32>(a) + (static_cast<uint32>(b) << 8);
uint32 z = len + (static_cast<uint32>(c) << 2);
return ShiftMix(y * k2 ^ z * k0) * k2;
}
return k2;
}
// This probably works well for 16-byte strings as well, but it may be overkill
// in that case.
static uint64 HashLen17to32(const char *s, size_t len) {
uint64 mul = k2 + len * 2;
uint64 a = Fetch64(s) * k1;
uint64 b = Fetch64(s + 8);
uint64 c = Fetch64(s + len - 8) * mul;
uint64 d = Fetch64(s + len - 16) * k2;
return HashLen16(Rotate(a + b, 43) + Rotate(c, 30) + d,
a + Rotate(b + k2, 18) + c, mul);
}
// Return a 16-byte hash for 48 bytes. Quick and dirty.
// Callers do best to use "random-looking" values for a and b.
static pair<uint64, uint64> WeakHashLen32WithSeeds(
uint64 w, uint64 x, uint64 y, uint64 z, uint64 a, uint64 b) {
a += w;
b = Rotate(b + a + z, 21);
uint64 c = a;
a += x;
a += y;
b += Rotate(a, 44);
return make_pair(a + z, b + c);
}
// Return a 16-byte hash for s[0] ... s[31], a, and b. Quick and dirty.
static pair<uint64, uint64> WeakHashLen32WithSeeds(
const char* s, uint64 a, uint64 b) {
return WeakHashLen32WithSeeds(Fetch64(s),
Fetch64(s + 8),
Fetch64(s + 16),
Fetch64(s + 24),
a,
b);
}
// Return an 8-byte hash for 33 to 64 bytes.
static uint64 HashLen33to64(const char *s, size_t len) {
uint64 mul = k2 + len * 2;
uint64 a = Fetch64(s) * k2;
uint64 b = Fetch64(s + 8);
uint64 c = Fetch64(s + len - 24);
uint64 d = Fetch64(s + len - 32);
uint64 e = Fetch64(s + 16) * k2;
uint64 f = Fetch64(s + 24) * 9;
uint64 g = Fetch64(s + len - 8);
uint64 h = Fetch64(s + len - 16) * mul;
uint64 u = Rotate(a + g, 43) + (Rotate(b, 30) + c) * 9;
uint64 v = ((a + g) ^ d) + f + 1;
uint64 w = bswap_64((u + v) * mul) + h;
uint64 x = Rotate(e + f, 42) + c;
uint64 y = (bswap_64((v + w) * mul) + g) * mul;
uint64 z = e + f + c;
a = bswap_64((x + z) * mul + y) + b;
b = ShiftMix((z + a) * mul + d + h) * mul;
return b + x;
}
uint64 CityHash64(const char *s, size_t len) {
if (len <= 32) {
if (len <= 16) {
return HashLen0to16(s, len);
} else {
return HashLen17to32(s, len);
}
} else if (len <= 64) {
return HashLen33to64(s, len);
}
// For strings over 64 bytes we hash the end first, and then as we
// loop we keep 56 bytes of state: v, w, x, y, and z.
uint64 x = Fetch64(s + len - 40);
uint64 y = Fetch64(s + len - 16) + Fetch64(s + len - 56);
uint64 z = HashLen16(Fetch64(s + len - 48) + len, Fetch64(s + len - 24));
pair<uint64, uint64> v = WeakHashLen32WithSeeds(s + len - 64, len, z);
pair<uint64, uint64> w = WeakHashLen32WithSeeds(s + len - 32, y + k1, x);
x = x * k1 + Fetch64(s);
// Decrease len to the nearest multiple of 64, and operate on 64-byte chunks.
len = (len - 1) & ~static_cast<size_t>(63);
do {
x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1;
y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1;
x ^= w.second;
y += v.first + Fetch64(s + 40);
z = Rotate(z + w.first, 33) * k1;
v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first);
w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16));
std::swap(z, x);
s += 64;
len -= 64;
} while (len != 0);
return HashLen16(HashLen16(v.first, w.first) + ShiftMix(y) * k1 + z,
HashLen16(v.second, w.second) + x);
}
uint64 CityHash64WithSeed(const char *s, size_t len, uint64 seed) {
return CityHash64WithSeeds(s, len, k2, seed);
}
uint64 CityHash64WithSeeds(const char *s, size_t len,
uint64 seed0, uint64 seed1) {
return HashLen16(CityHash64(s, len) - seed0, seed1);
}
// A subroutine for CityHash128(). Returns a decent 128-bit hash for strings
// of any length representable in signed long. Based on City and Murmur.
static uint128 CityMurmur(const char *s, size_t len, uint128 seed) {
uint64 a = Uint128Low64(seed);
uint64 b = Uint128High64(seed);
uint64 c = 0;
uint64 d = 0;
signed long l = len - 16;
if (l <= 0) { // len <= 16
a = ShiftMix(a * k1) * k1;
c = b * k1 + HashLen0to16(s, len);
d = ShiftMix(a + (len >= 8 ? Fetch64(s) : c));
} else { // len > 16
c = HashLen16(Fetch64(s + len - 8) + k1, a);
d = HashLen16(b + len, c + Fetch64(s + len - 16));
a += d;
do {
a ^= ShiftMix(Fetch64(s) * k1) * k1;
a *= k1;
b ^= a;
c ^= ShiftMix(Fetch64(s + 8) * k1) * k1;
c *= k1;
d ^= c;
s += 16;
l -= 16;
} while (l > 0);
}
a = HashLen16(a, c);
b = HashLen16(d, b);
return uint128(a ^ b, HashLen16(b, a));
}
uint128 CityHash128WithSeed(const char *s, size_t len, uint128 seed) {
if (len < 128) {
return CityMurmur(s, len, seed);
}
// We expect len >= 128 to be the common case. Keep 56 bytes of state:
// v, w, x, y, and z.
pair<uint64, uint64> v, w;
uint64 x = Uint128Low64(seed);
uint64 y = Uint128High64(seed);
uint64 z = len * k1;
v.first = Rotate(y ^ k1, 49) * k1 + Fetch64(s);
v.second = Rotate(v.first, 42) * k1 + Fetch64(s + 8);
w.first = Rotate(y + z, 35) * k1 + x;
w.second = Rotate(x + Fetch64(s + 88), 53) * k1;
// This is the same inner loop as CityHash64(), manually unrolled.
do {
x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1;
y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1;
x ^= w.second;
y += v.first + Fetch64(s + 40);
z = Rotate(z + w.first, 33) * k1;
v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first);
w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16));
std::swap(z, x);
s += 64;
x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1;
y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1;
x ^= w.second;
y += v.first + Fetch64(s + 40);
z = Rotate(z + w.first, 33) * k1;
v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first);
w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16));
std::swap(z, x);
s += 64;
len -= 128;
} while (LIKELY(len >= 128));
x += Rotate(v.first + z, 49) * k0;
y = y * k0 + Rotate(w.second, 37);
z = z * k0 + Rotate(w.first, 27);
w.first *= 9;
v.first *= k0;
// If 0 < len < 128, hash up to 4 chunks of 32 bytes each from the end of s.
for (size_t tail_done = 0; tail_done < len; ) {
tail_done += 32;
y = Rotate(x + y, 42) * k0 + v.second;
w.first += Fetch64(s + len - tail_done + 16);
x = x * k0 + w.first;
z += w.second + Fetch64(s + len - tail_done);
w.second += v.first;
v = WeakHashLen32WithSeeds(s + len - tail_done, v.first + z, v.second);
v.first *= k0;
}
// At this point our 56 bytes of state should contain more than
// enough information for a strong 128-bit hash. We use two
// different 56-byte-to-8-byte hashes to get a 16-byte final result.
x = HashLen16(x, v.first);
y = HashLen16(y + z, w.first);
return uint128(HashLen16(x + v.second, w.second) + y,
HashLen16(x + w.second, y + v.second));
}
uint128 CityHash128(const char *s, size_t len) {
return len >= 16 ?
CityHash128WithSeed(s + 16, len - 16,
uint128(Fetch64(s), Fetch64(s + 8) + k0)) :
CityHash128WithSeed(s, len, uint128(k0, k1));
}

17
lib/cleanup.cc Normal file
View File

@@ -0,0 +1,17 @@
#include "crucible/cleanup.h"
namespace crucible {
Cleanup::Cleanup(function<void()> func) :
m_cleaner(func)
{
}
Cleanup::~Cleanup()
{
if (m_cleaner) {
m_cleaner();
}
}
}

6
lib/configure.h.in Normal file
View File

@@ -0,0 +1,6 @@
#ifndef _CONFIGURE_H
#define ETC_PREFIX "@ETC_PREFIX@"
#define _CONFIGURE_H
#endif

View File

@@ -1,3 +1,31 @@
/* crc64.c -- compute CRC-64
* Copyright (C) 2013 Mark Adler
* Version 1.4 16 Dec 2013 Mark Adler
*/
/*
This software is provided 'as-is', without any express or implied
warranty. In no event will the author be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Mark Adler
madler@alumni.caltech.edu
*/
/* Substantially modified by Paul Jones for usage in bees */
#include "crucible/crc64.h"
#define POLY64REV 0xd800000000000000ULL
@@ -5,13 +33,16 @@
namespace crucible {
static bool init = false;
static uint64_t CRCTable[256];
static uint64_t CRCTable[8][256];
static void init_crc64_table()
{
if (!init) {
for (int i = 0; i <= 255; i++) {
uint64_t part = i;
uint64_t crc;
// Generate CRCs for all single byte sequences
for (int n = 0; n < 256; n++) {
uint64_t part = n;
for (int j = 0; j < 8; j++) {
if (part & 1) {
part = (part >> 1) ^ POLY64REV;
@@ -19,37 +50,53 @@ namespace crucible {
part >>= 1;
}
}
CRCTable[i] = part;
CRCTable[0][n] = part;
}
// Generate nested CRC table for slice-by-8 lookup
for (int n = 0; n < 256; n++) {
crc = CRCTable[0][n];
for (int k = 1; k < 8; k++) {
crc = CRCTable[0][crc & 0xff] ^ (crc >> 8);
CRCTable[k][n] = crc;
}
}
init = true;
}
}
uint64_t
Digest::CRC::crc64(const char *s)
{
init_crc64_table();
uint64_t crc = 0;
for (; *s; s++) {
uint64_t temp1 = crc >> 8;
uint64_t temp2 = CRCTable[(crc ^ static_cast<uint64_t>(*s)) & 0xff];
crc = temp1 ^ temp2;
}
return crc;
}
uint64_t
Digest::CRC::crc64(const void *p, size_t len)
{
init_crc64_table();
const unsigned char *next = static_cast<const unsigned char *>(p);
uint64_t crc = 0;
for (const unsigned char *s = static_cast<const unsigned char *>(p); len; --len) {
uint64_t temp1 = crc >> 8;
uint64_t temp2 = CRCTable[(crc ^ *s++) & 0xff];
crc = temp1 ^ temp2;
// Process individual bytes until we reach an 8-byte aligned pointer
while (len && (reinterpret_cast<uintptr_t>(next) & 7) != 0) {
crc = CRCTable[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
len--;
}
// Fast middle processing, 8 bytes (aligned!) per loop
while (len >= 8) {
crc ^= *(reinterpret_cast< const uint64_t *>(next));
crc = CRCTable[7][crc & 0xff] ^
CRCTable[6][(crc >> 8) & 0xff] ^
CRCTable[5][(crc >> 16) & 0xff] ^
CRCTable[4][(crc >> 24) & 0xff] ^
CRCTable[3][(crc >> 32) & 0xff] ^
CRCTable[2][(crc >> 40) & 0xff] ^
CRCTable[1][(crc >> 48) & 0xff] ^
CRCTable[0][crc >> 56];
next += 8;
len -= 8;
}
// Process remaining bytes (can't be larger than 8)
while (len) {
crc = CRCTable[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
len--;
}
return crc;

View File

@@ -32,7 +32,7 @@ namespace crucible {
// FIXME: could probably avoid some of these levels of indirection
static
function<void(string s)> current_catch_explainer = [&](string s) {
function<void(string s)> current_catch_explainer = [](string s) {
cerr << s << endl;
};

View File

@@ -1,104 +0,0 @@
#include "crucible/execpipe.h"
#include "crucible/chatter.h"
#include "crucible/error.h"
#include "crucible/process.h"
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <unistd.h>
namespace crucible {
using namespace std;
void
redirect_stdin(const Fd &child_fd)
{
dup2_or_die(child_fd, STDIN_FILENO);
}
void
redirect_stdin_stdout(const Fd &child_fd)
{
dup2_or_die(child_fd, STDOUT_FILENO);
dup2_or_die(child_fd, STDIN_FILENO);
}
void
redirect_stdin_stdout_stderr(const Fd &child_fd)
{
dup2_or_die(child_fd, STDERR_FILENO);
dup2_or_die(child_fd, STDOUT_FILENO);
dup2_or_die(child_fd, STDIN_FILENO);
}
void
redirect_stdout_stderr(const Fd &child_fd)
{
dup2_or_die(child_fd, STDERR_FILENO);
dup2_or_die(child_fd, STDOUT_FILENO);
}
void
redirect_stdout(const Fd &child_fd)
{
dup2_or_die(child_fd, STDOUT_FILENO);
}
void
redirect_stderr(const Fd &child_fd)
{
dup2_or_die(child_fd, STDERR_FILENO);
}
Fd popen(function<int()> f, function<void(const Fd &child_fd)> import_fd_fn)
{
Fd parent_fd, child_fd;
{
pair<Fd, Fd> fd_pair = socketpair_or_die();
parent_fd = fd_pair.first;
child_fd = fd_pair.second;
}
pid_t fv;
DIE_IF_MINUS_ONE(fv = fork());
if (fv) {
child_fd->close();
return parent_fd;
} else {
int rv = EXIT_FAILURE;
catch_all([&]() {
parent_fd->close();
import_fd_fn(child_fd);
// system("ls -l /proc/$$/fd/ >&2");
rv = f();
});
_exit(rv);
cerr << "PID " << getpid() << " TID " << gettid() << "STILL ALIVE" << endl;
system("ls -l /proc/$$/task/ >&2");
exit(EXIT_FAILURE);
}
}
string
read_all(Fd fd, size_t max_bytes, size_t chunk_bytes)
{
char buf[chunk_bytes];
string str;
size_t rv;
while (1) {
read_partial_or_die(fd, static_cast<void *>(buf), chunk_bytes, rv);
if (rv == 0) {
break;
}
if (max_bytes - str.size() < rv) {
THROW_ERROR(out_of_range, "Output size limit " << max_bytes << " exceeded by appending " << rv << " bytes read to " << str.size() << " already in string");
}
str.append(buf, rv);
}
return str;
}
}

View File

@@ -6,26 +6,40 @@
#include "crucible/limits.h"
#include "crucible/string.h"
namespace crucible {
using namespace std;
const off_t ExtentWalker::sc_step_size;
// fm_start, fm_length, fm_flags, m_extents
// fe_logical, fe_physical, fe_length, fe_flags
static const off_t MAX_OFFSET = numeric_limits<off_t>::max();
static const off_t FIEMAP_BLOCK_SIZE = 4096;
static bool __ew_do_log = getenv("EXTENTWALKER_DEBUG");
// Maximum number of extents from TREE_SEARCH.
static const unsigned sc_extent_fetch_max = 16;
// Minimum number of extents from TREE_SEARCH.
// If we don't get this number, we'll binary search backward
// until we reach the beginning of the file or find at least this
// number of extents.
static const unsigned sc_extent_fetch_min = 4;
// This is a guess that tries to land at least one extent
// before the target extent, so we don't have to search backward as often.
static const off_t sc_back_step_size = 64 * 1024;
#ifdef EXTENTWALKER_DEBUG
#define EWLOG(x) do { \
if (__ew_do_log) { \
CHATTER(x); \
} \
m_log << x << endl; \
} while (0)
#define EWTRACE(x) do { \
CHATTER_UNWIND(x); \
} while (0)
#else
#define EWLOG(x) do {} while (0)
#define EWTRACE(x) do {} while (0)
#endif
ostream &
operator<<(ostream &os, const Extent &e)
{
@@ -43,9 +57,7 @@ namespace crucible {
if (e.m_flags & Extent::OBSCURED) {
os << "Extent::OBSCURED|";
}
if (e.m_flags & ~(Extent::HOLE | Extent::PREALLOC | Extent::OBSCURED)) {
os << fiemap_extent_flags_ntoa(e.m_flags & ~(Extent::HOLE | Extent::PREALLOC | Extent::OBSCURED));
}
os << fiemap_extent_flags_ntoa(e.m_flags & ~(Extent::HOLE | Extent::PREALLOC | Extent::OBSCURED));
if (e.m_physical_len) {
os << ", physical_len = " << to_hex(e.m_physical_len);
}
@@ -71,23 +83,16 @@ namespace crucible {
ostream &
operator<<(ostream &os, const ExtentWalker &ew)
{
return os << "ExtentWalker {"
os << "ExtentWalker {"
<< " fd = " << name_fd(ew.m_fd)
<< ", stat.st_size = " << to_hex(ew.m_stat.st_size)
<< ", extents = " << ew.m_extents
<< ", current = [" << ew.m_current - ew.m_extents.begin()
<< "] }";
}
Extent::Extent() :
m_begin(0),
m_end(0),
m_physical(0),
m_flags(0),
m_physical_len(0),
m_logical_len(0),
m_offset(0)
{
<< "] ";
#ifdef EXTENTWALKER_DEBUG
os << "\nLog:\n" << ew.m_log.str() << "\nEnd log";
#endif
return os << "}";
}
Extent::operator bool() const
@@ -109,6 +114,18 @@ namespace crucible {
return m_begin == that.m_begin && m_end == that.m_end && m_physical == that.m_physical && m_flags == that.m_flags;
}
bool
Extent::compressed() const
{
return m_flags & FIEMAP_EXTENT_ENCODED;
}
uint64_t
Extent::bytenr() const
{
return compressed() ? m_physical : m_physical - m_offset;
}
ExtentWalker::ExtentWalker(Fd fd) :
m_fd(fd),
m_current(m_extents.begin())
@@ -161,8 +178,7 @@ namespace crucible {
void
ExtentWalker::run_fiemap(off_t pos)
{
ostringstream log;
CHATTER_UNWIND("Log of run_fiemap: " << log.str());
EWTRACE("Log of run_fiemap: " << m_log.str());
EWLOG("pos = " << to_hex(pos));
@@ -170,18 +186,24 @@ namespace crucible {
Vec fm;
off_t step_size = pos;
off_t begin = pos - min(pos, sc_step_size);
// Start backward search by dropping lowest bit
off_t step_size = (pos > 0) ? (pos ^ (pos & (pos - 1))) * 2 : 0;
// Start first pass through loop just a little before the target extent,
// because the first iteration will be wasted if we have an exact match.
off_t begin = pos - min(pos, sc_back_step_size);
// This loop should not run forever
int loop_count = 0;
int loop_limit = 99;
const int loop_limit = 99;
while (true) {
if (loop_count == 90) {
EWLOG(log.str());
#ifdef EXTENTWALKER_DEBUG
if (loop_count >= loop_limit) {
cerr << "Too many loops!" << endl << m_log.str() << endl;
abort();
}
THROW_CHECK1(runtime_error, loop_count, loop_count < loop_limit);
#endif
THROW_CHECK2(runtime_error, *this, loop_count, loop_count < loop_limit);
++loop_count;
// Get file size every time in case it changes under us
@@ -189,7 +211,16 @@ namespace crucible {
// Get fiemap begin..EOF
fm = get_extent_map(begin);
EWLOG("fiemap result loop count #" << loop_count << ":" << fm);
EWLOG("fiemap result loop count #" << loop_count << " begin " << to_hex(begin) << " pos "
<< to_hex(pos) << " step_size " << to_hex(step_size) << ":\n" << fm);
// Sanity check on the data: in order, not overlapping, not empty, not before pos
off_t sanity_pos = begin;
for (auto const &i : fm) {
THROW_CHECK1(runtime_error, fm, i.begin() >= sanity_pos);
THROW_CHECK1(runtime_error, fm, i.end() > i.begin());
sanity_pos = i.end();
}
// This algorithm seeks at least three extents: one before,
// one after, and one containing pos. Files which contain
@@ -197,15 +228,15 @@ namespace crucible {
// so handle those cases separately.
// FIEMAP lies, and we catch it in a lie about the size of the
// second extent. To work around this, try getting more than 3.
// second extent. To work around this, sc_extent_fetch_min is at least 4.
// 0..2(ish) extents
if (fm.size() < sc_extent_fetch_min) {
// If we are not at beginning of file, move backward
// If we are not at beginning of file, move backward by zeroing the lowest bit
if (begin > 0) {
step_size /= 2;
step_size = (begin > 0) ? (begin ^ (begin & (begin - 1))) : 0;
auto next_begin = (begin - min(step_size, begin)) & ~(FIEMAP_BLOCK_SIZE - 1);
EWLOG("step backward " << to_hex(begin) << " -> " << to_hex(next_begin) << " extents size " << fm.size());
EWLOG("step backward " << to_hex(begin) << " -> " << to_hex(next_begin));
if (begin == next_begin) {
EWLOG("step backward stopped");
break;
@@ -233,18 +264,18 @@ namespace crucible {
// We have at least three extents, so there is now a first and last.
// We want pos to be between first and last. There doesn't have
// to be an extent between these (it could be a hole).
auto &first_extent = fm.at(sc_extent_fetch_min - 2);
auto &first_extent = *fm.begin();
auto &last_extent = *fm.rbegin();
EWLOG("first_extent = " << first_extent);
EWLOG("last_extent = " << last_extent);
// First extent must end on or before pos
// First extent must end on or before pos; otherwise, go further back
if (first_extent.end() > pos) {
// Can we move backward?
if (begin > 0) {
step_size /= 2;
step_size = (begin > 0) ? (begin ^ (begin & (begin - 1))) : 0;
auto next_begin = (begin - min(step_size, begin)) & ~(FIEMAP_BLOCK_SIZE - 1);
EWLOG("step backward " << to_hex(begin) << " -> " << to_hex(next_begin) << " extents size " << fm.size());
EWLOG("step backward " << to_hex(begin) << " -> " << to_hex(next_begin));
if (begin == next_begin) {
EWLOG("step backward stopped");
break;
@@ -254,38 +285,29 @@ namespace crucible {
}
// We are as far back as we can go, so there must be no
// extent before pos (i.e. file starts with a hole).
// extent before pos (i.e. file starts with a hole
// or first extent starts at pos 0).
EWLOG("no extent before pos");
break;
}
// First extent ends on or before pos.
// If last extent is EOF then we have the entire file in the buffer.
// If last extent is EOF then we cannot more any further forward.
// pos could be in last extent, so skip the later checks that
// insist pos be located prior to the last extent.
if (last_extent.flags() & FIEMAP_EXTENT_LAST) {
break;
}
// Don't have EOF, must have an extent after pos.
// Don't have EOF, must have an extent after pos; otherwise, go forward
if (last_extent.begin() <= pos) {
// Set the bit just below the one we last cleared
step_size /= 2;
auto new_begin = (begin + step_size) & ~(FIEMAP_BLOCK_SIZE - 1);
auto new_begin = (begin + max(FIEMAP_BLOCK_SIZE, step_size)) & ~(FIEMAP_BLOCK_SIZE - 1);
EWLOG("step forward " << to_hex(begin) << " -> " << to_hex(new_begin));
if (begin == new_begin) {
EWLOG("step forward stopped");
break;
}
begin = new_begin;
continue;
}
// Last extent begins after pos, first extent ends on or before pos.
// All other cases should have been handled before here.
THROW_CHECK2(runtime_error, pos, first_extent, first_extent.end() <= pos);
THROW_CHECK2(runtime_error, pos, last_extent, last_extent.begin() > pos);
// We should probably stop now
break;
}
@@ -300,6 +322,11 @@ namespace crucible {
while (fmi != fm.end()) {
Extent new_extent(*fmi);
THROW_CHECK2(runtime_error, ipos, new_extent.m_begin, ipos <= new_extent.m_begin);
// Don't map extents past EOF, we can't read them
if (new_extent.m_begin >= m_stat.st_size) {
last_extent_is_last = true;
break;
}
if (new_extent.m_begin > ipos) {
Extent hole_extent;
hole_extent.m_begin = ipos;
@@ -327,12 +354,14 @@ namespace crucible {
hole_extent.m_flags |= FIEMAP_EXTENT_LAST;
}
new_vec.push_back(hole_extent);
ipos += new_vec.size();
ipos += hole_extent.size();
}
// Extent list must now be non-empty, at least a hole
THROW_CHECK1(runtime_error, new_vec.size(), !new_vec.empty());
// Allow last extent to extend beyond desired range (e.g. at EOF)
THROW_CHECK2(runtime_error, ipos, new_vec.rbegin()->m_end, ipos <= new_vec.rbegin()->m_end);
// ipos must match end of last extent
THROW_CHECK3(runtime_error, ipos, new_vec.rbegin()->m_end, m_stat.st_size, ipos == new_vec.rbegin()->m_end);
// If we have the last extent in the file, truncate it to the file size.
if (ipos >= m_stat.st_size) {
THROW_CHECK2(runtime_error, new_vec.rbegin()->m_begin, m_stat.st_size, m_stat.st_size > new_vec.rbegin()->m_begin);
@@ -340,9 +369,10 @@ namespace crucible {
new_vec.rbegin()->m_end = m_stat.st_size;
}
// Verify contiguous, ascending order, at least one Extent
// Verify at least one Extent
THROW_CHECK1(runtime_error, new_vec, !new_vec.empty());
// Verify contiguous, ascending order, only extent with FIEMAP_EXTENT_LAST flag is the last extent
ipos = new_vec.begin()->m_begin;
bool last_flag_last = false;
for (auto e : new_vec) {
@@ -352,7 +382,6 @@ namespace crucible {
ipos += e.size();
last_flag_last = e.m_flags & FIEMAP_EXTENT_LAST;
}
THROW_CHECK1(runtime_error, new_vec, !last_extent_is_last || new_vec.rbegin()->m_end == ipos);
m_extents = new_vec;
m_current = m_extents.begin();
@@ -368,7 +397,7 @@ namespace crucible {
void
ExtentWalker::seek(off_t pos)
{
CHATTER_UNWIND("seek " << to_hex(pos));
EWTRACE("seek " << to_hex(pos));
THROW_CHECK1(out_of_range, pos, pos >= 0);
Itr rv = find_in_cache(pos);
if (rv != m_extents.end()) {
@@ -377,29 +406,28 @@ namespace crucible {
}
run_fiemap(pos);
m_current = find_in_cache(pos);
THROW_CHECK2(runtime_error, *this, to_hex(pos), m_current != m_extents.end());
}
Extent
ExtentWalker::current()
{
THROW_CHECK2(invalid_argument, *this, m_extents.size(), m_current != m_extents.end());
CHATTER_UNWIND("current " << *m_current);
return *m_current;
}
bool
ExtentWalker::next()
{
CHATTER_UNWIND("next");
EWTRACE("next");
THROW_CHECK1(invalid_argument, (m_current != m_extents.end()), m_current != m_extents.end());
if (current().m_end >= m_stat.st_size) {
CHATTER_UNWIND("next EOF");
EWTRACE("next EOF");
return false;
}
auto next_pos = current().m_end;
if (next_pos >= m_stat.st_size) {
CHATTER_UNWIND("next next_pos = " << next_pos << " m_stat.st_size = " << m_stat.st_size);
EWTRACE("next next_pos = " << next_pos << " m_stat.st_size = " << m_stat.st_size);
return false;
}
seek(next_pos);
@@ -417,16 +445,16 @@ namespace crucible {
bool
ExtentWalker::prev()
{
CHATTER_UNWIND("prev");
EWTRACE("prev");
THROW_CHECK1(invalid_argument, (m_current != m_extents.end()), m_current != m_extents.end());
auto prev_iter = m_current;
if (prev_iter->m_begin == 0) {
CHATTER_UNWIND("prev BOF");
EWTRACE("prev BOF");
return false;
}
THROW_CHECK1(invalid_argument, (prev_iter != m_extents.begin()), prev_iter != m_extents.begin());
--prev_iter;
CHATTER_UNWIND("prev seeking to " << *prev_iter << "->m_begin");
EWTRACE("prev seeking to " << *prev_iter << "->m_begin");
auto prev_end = current().m_begin;
seek(prev_iter->m_begin);
THROW_CHECK1(runtime_error, (m_current != m_extents.end()), m_current != m_extents.end());
@@ -485,7 +513,7 @@ namespace crucible {
sk.min_type = sk.max_type = BTRFS_EXTENT_DATA_KEY;
sk.nr_items = sc_extent_fetch_max;
CHATTER_UNWIND("sk " << sk << " root_fd " << name_fd(m_root_fd));
EWTRACE("sk " << sk << " root_fd " << name_fd(m_root_fd));
sk.do_ioctl(m_root_fd);
Vec rv;
@@ -511,37 +539,38 @@ namespace crucible {
Extent e;
e.m_begin = i.offset;
auto compressed = call_btrfs_get(btrfs_stack_file_extent_compression, i.m_data);
auto compressed = btrfs_get_member(&btrfs_file_extent_item::compression, i.m_data);
// FIEMAP told us about compressed extents and we can too
if (compressed) {
e.m_flags |= FIEMAP_EXTENT_ENCODED;
}
auto type = call_btrfs_get(btrfs_stack_file_extent_type, i.m_data);
auto type = btrfs_get_member(&btrfs_file_extent_item::type, i.m_data);
off_t len = -1;
switch (type) {
default:
switch (type) {
default:
cerr << "Unhandled file extent type " << type << " in root " << m_tree_id << " ino " << m_stat.st_ino << endl;
break;
case BTRFS_FILE_EXTENT_INLINE:
len = ranged_cast<off_t>(call_btrfs_get(btrfs_stack_file_extent_ram_bytes, i.m_data));
case BTRFS_FILE_EXTENT_INLINE:
len = ranged_cast<off_t>(btrfs_get_member(&btrfs_file_extent_item::ram_bytes, i.m_data));
e.m_flags |= FIEMAP_EXTENT_DATA_INLINE | FIEMAP_EXTENT_NOT_ALIGNED;
// Inline extents are never obscured, so don't bother filling in m_physical_len, etc.
break;
case BTRFS_FILE_EXTENT_PREALLOC:
break;
case BTRFS_FILE_EXTENT_PREALLOC:
e.m_flags |= Extent::PREALLOC;
case BTRFS_FILE_EXTENT_REG: {
e.m_physical = call_btrfs_get(btrfs_stack_file_extent_disk_bytenr, i.m_data);
// fallthrough
case BTRFS_FILE_EXTENT_REG: {
e.m_physical = btrfs_get_member(&btrfs_file_extent_item::disk_bytenr, i.m_data);
// This is the length of the full extent (decompressed)
off_t ram = ranged_cast<off_t>(call_btrfs_get(btrfs_stack_file_extent_ram_bytes, i.m_data));
off_t ram = ranged_cast<off_t>(btrfs_get_member(&btrfs_file_extent_item::ram_bytes, i.m_data));
// This is the length of the part of the extent appearing in the file (decompressed)
len = ranged_cast<off_t>(call_btrfs_get(btrfs_stack_file_extent_num_bytes, i.m_data));
len = ranged_cast<off_t>(btrfs_get_member(&btrfs_file_extent_item::num_bytes, i.m_data));
// This is the offset from start of on-disk extent to the part we see in the file (decompressed)
// May be negative due to the kind of bug we're stuck with forever, so no cast range check
off_t offset = call_btrfs_get(btrfs_stack_file_extent_offset, i.m_data);
off_t offset = btrfs_get_member(&btrfs_file_extent_item::offset, i.m_data);
// If there is a physical address there must be size too
if (e.m_physical) {
@@ -594,7 +623,7 @@ namespace crucible {
e.m_flags |= FIEMAP_EXTENT_LAST;
}
// FIXME: no FIEMAP_EXTENT_SHARED
// WONTFIX: non-trivial to replicate LOGIAL_INO
// WONTFIX: non-trivial to replicate LOGICAL_INO
rv.push_back(e);
}
}
@@ -610,9 +639,8 @@ namespace crucible {
ExtentWalker::Vec
ExtentWalker::get_extent_map(off_t pos)
{
Fiemap fm;
fm.fm_start = ranged_cast<uint64_t>(pos);
fm.fm_length = ranged_cast<uint64_t>(numeric_limits<off_t>::max() - pos);
EWLOG("get_extent_map(" << to_hex(pos) << ")");
Fiemap fm(ranged_cast<uint64_t>(pos), ranged_cast<uint64_t>(numeric_limits<off_t>::max() - pos));
fm.m_max_count = fm.m_min_count = sc_extent_fetch_max;
fm.do_ioctl(m_fd);
Vec rv;
@@ -623,7 +651,9 @@ namespace crucible {
e.m_physical = i.fe_physical;
e.m_flags = i.fe_flags;
rv.push_back(e);
EWLOG("push_back(" << e << ")");
}
EWLOG("get_extent_map(" << to_hex(pos) << ") returning " << rv.size() << " extents");
return rv;
}

152
lib/fd.cc
View File

@@ -107,12 +107,6 @@ namespace crucible {
}
}
IOHandle::IOHandle() :
m_fd(-1)
{
CHATTER_TRACE("open fd " << m_fd << " in " << this);
}
IOHandle::IOHandle(int fd) :
m_fd(fd)
{
@@ -120,12 +114,52 @@ namespace crucible {
}
int
IOHandle::release_fd()
IOHandle::get_fd() const
{
CHATTER_TRACE("release fd " << m_fd << " in " << this);
int rv = m_fd;
m_fd = -1;
return rv;
return m_fd;
}
NamedPtr<IOHandle, int> Fd::s_named_ptr([](int fd) { return make_shared<IOHandle>(fd); });
Fd::Fd() :
m_handle(s_named_ptr(-1))
{
}
Fd::Fd(int fd) :
m_handle(s_named_ptr(fd < 0 ? -1 : fd))
{
}
Fd &
Fd::operator=(int const fd)
{
m_handle = s_named_ptr(fd < 0 ? -1 : fd);
return *this;
}
Fd &
Fd::operator=(const shared_ptr<IOHandle> &handle)
{
m_handle = s_named_ptr.insert(handle, handle->get_fd());
return *this;
}
Fd::operator int() const
{
return m_handle->get_fd();
}
bool
Fd::operator!() const
{
return m_handle->get_fd() < 0;
}
shared_ptr<IOHandle>
Fd::operator->() const
{
return m_handle;
}
// XXX: necessary? useful?
@@ -174,11 +208,13 @@ namespace crucible {
static const struct bits_ntoa_table mmap_flags_table[] = {
NTOA_TABLE_ENTRY_BITS(MAP_SHARED),
NTOA_TABLE_ENTRY_BITS(MAP_PRIVATE),
#ifdef MAP_32BIT
NTOA_TABLE_ENTRY_BITS(MAP_32BIT),
#endif
NTOA_TABLE_ENTRY_BITS(MAP_ANONYMOUS),
NTOA_TABLE_ENTRY_BITS(MAP_DENYWRITE),
NTOA_TABLE_ENTRY_BITS(MAP_EXECUTABLE),
#if MAP_FILE
#ifdef MAP_FILE
NTOA_TABLE_ENTRY_BITS(MAP_FILE),
#endif
NTOA_TABLE_ENTRY_BITS(MAP_FIXED),
@@ -230,6 +266,14 @@ namespace crucible {
}
}
void
ftruncate_or_die(int fd, off_t size)
{
if (::ftruncate(fd, size)) {
THROW_ERRNO("ftruncate: " << name_fd(fd) << " size " << size);
}
}
string
socket_domain_ntoa(int domain)
{
@@ -317,8 +361,11 @@ namespace crucible {
THROW_ERROR(invalid_argument, "pwrite: trying to write on a closed file descriptor");
}
int rv = ::pwrite(fd, buf, size, offset);
if (rv != static_cast<int>(size)) {
THROW_ERROR(runtime_error, "pwrite: only " << rv << " of " << size << " bytes written at offset " << offset);
if (rv < 0) {
THROW_ERRNO("pwrite: could not write " << size << " bytes at fd " << name_fd(fd) << " offset " << offset);
}
if (rv != static_cast<ssize_t>(size)) {
THROW_ERROR(runtime_error, "pwrite: only " << rv << " of " << size << " bytes written at fd " << name_fd(fd) << " offset " << offset);
}
}
@@ -348,7 +395,7 @@ namespace crucible {
}
THROW_ERRNO("read: " << size << " bytes");
}
if (rv > static_cast<int>(size)) {
if (rv > static_cast<ssize_t>(size)) {
THROW_ERROR(runtime_error, "read: somehow read more bytes (" << rv << ") than requested (" << size << ")");
}
if (rv == 0) break;
@@ -397,8 +444,8 @@ namespace crucible {
}
THROW_ERRNO("pread: " << size << " bytes");
}
if (rv != static_cast<int>(size)) {
THROW_ERROR(runtime_error, "pread: " << size << " bytes at offset " << offset << " returned " << rv);
if (rv != static_cast<ssize_t>(size)) {
THROW_ERROR(runtime_error, "pread: " << size << " bytes at fd " << name_fd(fd) << " offset " << offset << " returned " << rv);
}
break;
}
@@ -414,21 +461,28 @@ namespace crucible {
template<>
void
pread_or_die<vector<char>>(int fd, vector<char> &text, off_t offset)
pread_or_die<ByteVector>(int fd, ByteVector &text, off_t offset)
{
return pread_or_die(fd, text.data(), text.size(), offset);
}
template<>
void
pread_or_die<vector<uint8_t>>(int fd, vector<uint8_t> &text, off_t offset)
pwrite_or_die<ByteVector>(int fd, const ByteVector &text, off_t offset)
{
return pread_or_die(fd, text.data(), text.size(), offset);
return pwrite_or_die(fd, text.data(), text.size(), offset);
}
Stat::Stat()
template<>
void
pwrite_or_die<string>(int fd, const string &text, off_t offset)
{
return pwrite_or_die(fd, text.data(), text.size(), offset);
}
Stat::Stat() :
stat( (stat) { } )
{
memset_zero<stat>(this);
}
Stat &
@@ -447,18 +501,39 @@ namespace crucible {
return *this;
}
Stat::Stat(int fd)
Stat::Stat(int fd) :
stat( (stat) { } )
{
memset_zero<stat>(this);
fstat(fd);
}
Stat::Stat(const string &filename)
Stat::Stat(const string &filename) :
stat( (stat) { } )
{
memset_zero<stat>(this);
lstat(filename);
}
int
ioctl_iflags_get(int fd)
{
int attr = 0;
DIE_IF_MINUS_ONE(ioctl(fd, FS_IOC_GETFLAGS, &attr));
return attr;
}
void
ioctl_iflags_set(int fd, int attr)
{
// This bit of nonsense brought to you by Valgrind.
union {
int attr;
long zero;
} u;
u.zero = 0;
u.attr = attr;
DIE_IF_MINUS_ONE(ioctl(fd, FS_IOC_SETFLAGS, &u.attr));
}
string
readlink_or_die(const string &path)
{
@@ -484,6 +559,24 @@ namespace crucible {
THROW_ERROR(runtime_error, "readlink: maximum buffer size exceeded");
}
static string __relative_path;
string
relative_path()
{
return __relative_path;
}
void
set_relative_path(string path)
{
path = path + "/";
for (string::size_type i = path.find("//"); i != string::npos; i = path.find("//")) {
path.erase(i, 1);
}
__relative_path = path;
}
// Turn a FD into a human-recognizable filename OR an error message.
string
name_fd(int fd)
@@ -491,7 +584,12 @@ namespace crucible {
try {
ostringstream oss;
oss << "/proc/self/fd/" << fd;
return readlink_or_die(oss.str());
string path = readlink_or_die(oss.str());
if (!__relative_path.empty() && 0 == path.find(__relative_path))
{
path.erase(0, __relative_path.length());
}
return path;
} catch (exception &e) {
return string(e.what());
}

599
lib/fs.cc
View File

@@ -2,10 +2,10 @@
#include "crucible/error.h"
#include "crucible/fd.h"
#include "crucible/hexdump.h"
#include "crucible/limits.h"
#include "crucible/ntoa.h"
#include "crucible/string.h"
#include "crucible/uuid.h"
// FS_IOC_FIEMAP
#include <linux/fs.h>
@@ -33,19 +33,11 @@ namespace crucible {
#endif
}
BtrfsExtentInfo::BtrfsExtentInfo(int dst_fd, off_t dst_offset)
{
memset_zero<btrfs_ioctl_same_extent_info>(this);
fd = dst_fd;
logical_offset = dst_offset;
}
BtrfsExtentSame::BtrfsExtentSame(int src_fd, off_t src_offset, off_t src_length) :
m_logical_offset(src_offset),
m_length(src_length),
m_fd(src_fd)
{
memset_zero<btrfs_ioctl_same_args>(this);
logical_offset = src_offset;
length = src_length;
}
BtrfsExtentSame::~BtrfsExtentSame()
@@ -53,9 +45,12 @@ namespace crucible {
}
void
BtrfsExtentSame::add(int fd, off_t offset)
BtrfsExtentSame::add(int const fd, uint64_t const offset)
{
m_info.push_back(BtrfsExtentInfo(fd, offset));
m_info.push_back( (btrfs_ioctl_same_extent_info) {
.fd = fd,
.logical_offset = offset,
});
}
ostream &
@@ -112,11 +107,8 @@ namespace crucible {
os << " '" << fd_name << "'";
});
}
os << ", .logical_offset = " << to_hex(bes.logical_offset);
os << ", .length = " << to_hex(bes.length);
os << ", .dest_count = " << bes.dest_count;
os << ", .reserved1 = " << bes.reserved1;
os << ", .reserved2 = " << bes.reserved2;
os << ", .logical_offset = " << to_hex(bes.m_logical_offset);
os << ", .length = " << to_hex(bes.m_length);
os << ", .info[] = {";
for (size_t i = 0; i < bes.m_info.size(); ++i) {
os << " [" << i << "] = " << &(bes.m_info[i]) << ",";
@@ -127,67 +119,25 @@ namespace crucible {
void
btrfs_clone_range(int src_fd, off_t src_offset, off_t src_length, int dst_fd, off_t dst_offset)
{
struct btrfs_ioctl_clone_range_args args;
memset_zero(&args);
args.src_fd = src_fd;
args.src_offset = src_offset;
args.src_length = src_length;
args.dest_offset = dst_offset;
btrfs_ioctl_clone_range_args args ( (btrfs_ioctl_clone_range_args) {
.src_fd = src_fd,
.src_offset = ranged_cast<uint64_t, off_t>(src_offset),
.src_length = ranged_cast<uint64_t, off_t>(src_length),
.dest_offset = ranged_cast<uint64_t, off_t>(dst_offset),
} );
DIE_IF_MINUS_ONE(ioctl(dst_fd, BTRFS_IOC_CLONE_RANGE, &args));
}
// Userspace emulation of extent-same ioctl to work around kernel bugs
// (a memory leak, a deadlock, inability to cope with unaligned EOF, and a length limit)
// The emulation is incomplete: no locking, and we always change ctime
void
BtrfsExtentSameByClone::do_ioctl()
{
if (length <= 0) {
throw out_of_range(string("length = 0 in ") + __PRETTY_FUNCTION__);
}
vector<char> cmp_buf_common(length);
vector<char> cmp_buf_iter(length);
pread_or_die(m_fd, cmp_buf_common.data(), length, logical_offset);
for (auto i = m_info.begin(); i != m_info.end(); ++i) {
i->status = -EIO;
i->bytes_deduped = 0;
// save atime/ctime for later
Stat target_stat(i->fd);
pread_or_die(i->fd, cmp_buf_iter.data(), length, i->logical_offset);
if (cmp_buf_common == cmp_buf_iter) {
// This never happens, so stop checking.
// assert(!memcmp(cmp_buf_common.data(), cmp_buf_iter.data(), length));
btrfs_clone_range(m_fd, logical_offset, length, i->fd, i->logical_offset);
i->status = 0;
i->bytes_deduped = length;
// The extent-same ioctl does not change mtime (as of patch v4)
struct timespec restore_ts[2] = {
target_stat.st_atim,
target_stat.st_mtim
};
// Ignore futimens failure as the real extent-same ioctl would never raise it
futimens(i->fd, restore_ts);
} else {
assert(memcmp(cmp_buf_common.data(), cmp_buf_iter.data(), length));
i->status = BTRFS_SAME_DATA_DIFFERS;
}
}
}
void
BtrfsExtentSame::do_ioctl()
{
dest_count = m_info.size();
vector<char> ioctl_arg = vector_copy_struct<btrfs_ioctl_same_args>(this);
ioctl_arg.resize(sizeof(btrfs_ioctl_same_args) + dest_count * sizeof(btrfs_ioctl_same_extent_info), 0);
btrfs_ioctl_same_args *ioctl_ptr = reinterpret_cast<btrfs_ioctl_same_args *>(ioctl_arg.data());
const size_t buf_size = sizeof(btrfs_ioctl_same_args) + m_info.size() * sizeof(btrfs_ioctl_same_extent_info);
ByteVector ioctl_arg( (btrfs_ioctl_same_args) {
.logical_offset = m_logical_offset,
.length = m_length,
.dest_count = ranged_cast<decltype(btrfs_ioctl_same_args::dest_count)>(m_info.size()),
}, buf_size);
btrfs_ioctl_same_args *const ioctl_ptr = ioctl_arg.get<btrfs_ioctl_same_args>();
size_t count = 0;
for (auto i = m_info.cbegin(); i != m_info.cend(); ++i) {
ioctl_ptr->info[count] = static_cast<const btrfs_ioctl_same_extent_info &>(m_info[count]);
@@ -209,12 +159,13 @@ namespace crucible {
{
THROW_CHECK1(invalid_argument, src_length, src_length > 0);
while (src_length > 0) {
off_t length = min(off_t(BTRFS_MAX_DEDUPE_LEN), src_length);
BtrfsExtentSame bes(src_fd, src_offset, length);
BtrfsExtentSame bes(src_fd, src_offset, src_length);
bes.add(dst_fd, dst_offset);
bes.do_ioctl();
auto status = bes.m_info.at(0).status;
const auto status = bes.m_info.at(0).status;
if (status == 0) {
const off_t length = bes.m_info.at(0).bytes_deduped;
THROW_CHECK0(invalid_argument, length > 0);
src_offset += length;
dst_offset += length;
src_length -= length;
@@ -233,23 +184,22 @@ namespace crucible {
}
BtrfsDataContainer::BtrfsDataContainer(size_t buf_size) :
m_data(buf_size, 0)
m_data(buf_size)
{
}
void *
BtrfsDataContainer::prepare()
BtrfsDataContainer::prepare(size_t container_size)
{
btrfs_data_container *p = reinterpret_cast<btrfs_data_container *>(m_data.data());
size_t min_size = offsetof(btrfs_data_container, val);
size_t container_size = m_data.size();
const size_t min_size = offsetof(btrfs_data_container, val);
if (container_size < min_size) {
THROW_ERROR(out_of_range, "container size " << container_size << " smaller than minimum " << min_size);
}
p->bytes_left = 0;
p->bytes_missing = 0;
p->elem_cnt = 0;
p->elem_missed = 0;
if (m_data.size() < container_size) {
m_data = ByteVector(container_size);
}
const auto p = m_data.get<btrfs_data_container>();
*p = (btrfs_data_container) { };
return p;
}
@@ -262,25 +212,29 @@ namespace crucible {
decltype(btrfs_data_container::bytes_left)
BtrfsDataContainer::get_bytes_left() const
{
return bytes_left;
const auto p = m_data.get<btrfs_data_container>();
return p->bytes_left;
}
decltype(btrfs_data_container::bytes_missing)
BtrfsDataContainer::get_bytes_missing() const
{
return bytes_missing;
const auto p = m_data.get<btrfs_data_container>();
return p->bytes_missing;
}
decltype(btrfs_data_container::elem_cnt)
BtrfsDataContainer::get_elem_cnt() const
{
return elem_cnt;
const auto p = m_data.get<btrfs_data_container>();
return p->elem_cnt;
}
decltype(btrfs_data_container::elem_missed)
BtrfsDataContainer::get_elem_missed() const
{
return elem_missed;
const auto p = m_data.get<btrfs_data_container>();
return p->elem_missed;
}
ostream &
@@ -290,7 +244,7 @@ namespace crucible {
return os << "BtrfsIoctlLogicalInoArgs NULL";
}
os << "BtrfsIoctlLogicalInoArgs {";
os << " .logical = " << to_hex(p->logical);
os << " .m_logical = " << to_hex(p->m_logical);
os << " .inodes[] = {\n";
unsigned count = 0;
for (auto i = p->m_iors.cbegin(); i != p->m_iors.cend(); ++i) {
@@ -301,33 +255,134 @@ namespace crucible {
}
BtrfsIoctlLogicalInoArgs::BtrfsIoctlLogicalInoArgs(uint64_t new_logical, size_t new_size) :
m_container(new_size)
m_container_size(new_size),
m_container(new_size),
m_logical(new_logical)
{
memset_zero<btrfs_ioctl_logical_ino_args>(this);
logical = new_logical;
}
size_t
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::size() const
{
return m_end - m_begin;
}
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::const_iterator
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::cbegin() const
{
return m_begin;
}
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::const_iterator
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::cend() const
{
return m_end;
}
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::iterator
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::begin() const
{
return m_begin;
}
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::iterator
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::end() const
{
return m_end;
}
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::iterator
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::data() const
{
return m_begin;
}
void
BtrfsIoctlLogicalInoArgs::BtrfsInodeOffsetRootSpan::clear()
{
m_end = m_begin = nullptr;
}
void
BtrfsIoctlLogicalInoArgs::set_flags(uint64_t new_flags)
{
m_flags = new_flags;
}
uint64_t
BtrfsIoctlLogicalInoArgs::get_flags() const
{
// We are still supporting building with old headers that don't have .flags yet
return m_flags;
}
void
BtrfsIoctlLogicalInoArgs::set_logical(uint64_t new_logical)
{
m_logical = new_logical;
}
void
BtrfsIoctlLogicalInoArgs::set_size(uint64_t new_size)
{
m_container_size = new_size;
}
bool
BtrfsIoctlLogicalInoArgs::do_ioctl_nothrow(int fd)
{
btrfs_ioctl_logical_ino_args *p = static_cast<btrfs_ioctl_logical_ino_args *>(this);
inodes = reinterpret_cast<uint64_t>(m_container.prepare());
size = m_container.get_size();
btrfs_ioctl_logical_ino_args args = (btrfs_ioctl_logical_ino_args) {
.logical = m_logical,
.size = m_container_size,
.inodes = reinterpret_cast<uintptr_t>(m_container.prepare(m_container_size)),
};
// We are still supporting building with old headers that don't have .flags yet
*(&args.reserved[0] + 3) = m_flags;
btrfs_ioctl_logical_ino_args *const p = &args;
m_iors.clear();
if (ioctl(fd, BTRFS_IOC_LOGICAL_INO, p)) {
return false;
static unsigned long bili_version = 0;
if (get_flags() == 0) {
// Could use either V1 or V2
if (bili_version) {
// We tested both versions and came to a decision
if (ioctl(fd, bili_version, p)) {
return false;
}
} else {
// Try V2
if (ioctl(fd, BTRFS_IOC_LOGICAL_INO_V2, p)) {
// V2 failed, try again with V1
if (ioctl(fd, BTRFS_IOC_LOGICAL_INO, p)) {
// both V1 and V2 failed, doesn't tell us which one to choose
return false;
}
// V1 and V2 both tested with same arguments, V1 OK, and V2 failed
bili_version = BTRFS_IOC_LOGICAL_INO;
} else {
// V2 succeeded, don't use V1 any more
bili_version = BTRFS_IOC_LOGICAL_INO_V2;
}
}
} else {
// Flags/size require a V2 feature, no fallback to V1 possible
if (ioctl(fd, BTRFS_IOC_LOGICAL_INO_V2, p)) {
return false;
}
// V2 succeeded so we don't need to probe any more
bili_version = BTRFS_IOC_LOGICAL_INO_V2;
}
btrfs_data_container *bdc = reinterpret_cast<btrfs_data_container *>(p->inodes);
BtrfsInodeOffsetRoot *input_iter = reinterpret_cast<BtrfsInodeOffsetRoot *>(bdc->val);
m_iors.reserve(bdc->elem_cnt);
for (auto count = bdc->elem_cnt; count > 2; count -= 3) {
m_iors.push_back(*input_iter++);
}
btrfs_data_container *const bdc = reinterpret_cast<btrfs_data_container *>(p->inodes);
BtrfsInodeOffsetRoot *const ior_iter = reinterpret_cast<BtrfsInodeOffsetRoot *>(bdc->val);
// elem_cnt counts uint64_t, but BtrfsInodeOffsetRoot is 3x uint64_t
THROW_CHECK1(runtime_error, bdc->elem_cnt, bdc->elem_cnt % 3 == 0);
m_iors.m_begin = ior_iter;
m_iors.m_end = ior_iter + bdc->elem_cnt / 3;
return true;
}
@@ -350,9 +405,10 @@ namespace crucible {
}
BtrfsIoctlInoPathArgs::BtrfsIoctlInoPathArgs(uint64_t inode, size_t new_size) :
m_container(new_size)
btrfs_ioctl_ino_path_args( (btrfs_ioctl_ino_path_args) { } ),
m_container_size(new_size)
{
memset_zero<btrfs_ioctl_ino_path_args>(this);
assert(inum == 0);
inum = inode;
}
@@ -360,8 +416,9 @@ namespace crucible {
BtrfsIoctlInoPathArgs::do_ioctl_nothrow(int fd)
{
btrfs_ioctl_ino_path_args *p = static_cast<btrfs_ioctl_ino_path_args *>(this);
fspath = reinterpret_cast<uint64_t>(m_container.prepare());
size = m_container.get_size();
BtrfsDataContainer container(m_container_size);
fspath = reinterpret_cast<uintptr_t>(container.prepare(m_container_size));
size = container.get_size();
m_paths.clear();
@@ -369,16 +426,16 @@ namespace crucible {
return false;
}
btrfs_data_container *bdc = reinterpret_cast<btrfs_data_container *>(p->fspath);
btrfs_data_container *const bdc = reinterpret_cast<btrfs_data_container *>(p->fspath);
m_paths.reserve(bdc->elem_cnt);
const uint64_t *up = reinterpret_cast<const uint64_t *>(bdc->val);
const char *cp = reinterpret_cast<const char *>(bdc->val);
const char *const cp = reinterpret_cast<const char *>(bdc->val);
for (auto count = bdc->elem_cnt; count > 0; --count) {
const char *path = cp + *up++;
if (static_cast<size_t>(path - cp) > m_container.get_size()) {
THROW_ERROR(out_of_range, "offset " << (path - cp) << " > size " << m_container.get_size() << " in " << __PRETTY_FUNCTION__);
const char *const path = cp + *up++;
if (static_cast<size_t>(path - cp) > container.get_size()) {
THROW_ERROR(out_of_range, "offset " << (path - cp) << " > size " << container.get_size() << " in " << __PRETTY_FUNCTION__);
}
m_paths.push_back(string(path));
}
@@ -411,9 +468,10 @@ namespace crucible {
return os;
}
BtrfsIoctlInoLookupArgs::BtrfsIoctlInoLookupArgs(uint64_t new_objectid)
BtrfsIoctlInoLookupArgs::BtrfsIoctlInoLookupArgs(uint64_t new_objectid) :
btrfs_ioctl_ino_lookup_args( (btrfs_ioctl_ino_lookup_args) { } )
{
memset_zero<btrfs_ioctl_ino_lookup_args>(this);
assert(objectid == 0);
objectid = new_objectid;
}
@@ -431,9 +489,9 @@ namespace crucible {
}
}
BtrfsIoctlDefragRangeArgs::BtrfsIoctlDefragRangeArgs()
BtrfsIoctlDefragRangeArgs::BtrfsIoctlDefragRangeArgs() :
btrfs_ioctl_defrag_range_args( (btrfs_ioctl_defrag_range_args) { } )
{
memset_zero<btrfs_ioctl_defrag_range_args>(this);
}
bool
@@ -463,11 +521,13 @@ namespace crucible {
}
string
btrfs_ioctl_defrag_range_compress_type_ntoa(uint32_t compress_type)
btrfs_compress_type_ntoa(uint8_t compress_type)
{
static const bits_ntoa_table table[] = {
NTOA_TABLE_ENTRY_ENUM(BTRFS_COMPRESS_NONE),
NTOA_TABLE_ENTRY_ENUM(BTRFS_COMPRESS_ZLIB),
NTOA_TABLE_ENTRY_ENUM(BTRFS_COMPRESS_LZO),
NTOA_TABLE_ENTRY_ENUM(BTRFS_COMPRESS_ZSTD),
NTOA_TABLE_ENTRY_END()
};
return bits_ntoa(compress_type, table);
@@ -484,14 +544,14 @@ namespace crucible {
os << " .len = " << p->len;
os << " .flags = " << btrfs_ioctl_defrag_range_flags_ntoa(p->flags);
os << " .extent_thresh = " << p->extent_thresh;
os << " .compress_type = " << btrfs_ioctl_defrag_range_compress_type_ntoa(p->compress_type);
os << " .compress_type = " << btrfs_compress_type_ntoa(p->compress_type);
os << " .unused[4] = { " << p->unused[0] << ", " << p->unused[1] << ", " << p->unused[2] << ", " << p->unused[3] << "} }";
return os;
}
FiemapExtent::FiemapExtent()
FiemapExtent::FiemapExtent() :
fiemap_extent( (fiemap_extent) { } )
{
memset_zero<fiemap_extent>(this);
}
FiemapExtent::FiemapExtent(const fiemap_extent &that)
@@ -598,13 +658,10 @@ namespace crucible {
operator<<(ostream &os, const Fiemap &args)
{
os << "Fiemap {";
os << " .fm_start = " << to_hex(args.fm_start) << ".." << to_hex(args.fm_start + args.fm_length);
os << ", .fm_length = " << to_hex(args.fm_length);
if (args.fm_flags) os << ", .fm_flags = " << fiemap_flags_ntoa(args.fm_flags);
os << ", .fm_mapped_extents = " << args.fm_mapped_extents;
os << ", .fm_extent_count = " << args.fm_extent_count;
if (args.fm_reserved) os << ", .fm_reserved = " << args.fm_reserved;
os << ", .fm_extents[] = {";
os << " .m_start = " << to_hex(args.m_start) << ".." << to_hex(args.m_start + args.m_length);
os << ", .m_length = " << to_hex(args.m_length);
os << ", .m_flags = " << fiemap_flags_ntoa(args.m_flags);
os << ", .fm_extents[" << args.m_extents.size() << "] = {";
size_t count = 0;
for (auto i = args.m_extents.cbegin(); i != args.m_extents.cend(); ++i) {
os << "\n\t[" << count++ << "] = " << &(*i) << ",";
@@ -612,41 +669,35 @@ namespace crucible {
return os << "\n}";
}
Fiemap::Fiemap(uint64_t start, uint64_t length)
Fiemap::Fiemap(uint64_t start, uint64_t length) :
m_start(start),
m_length(length)
{
memset_zero<fiemap>(this);
fm_start = start;
fm_length = length;
// FIEMAP is slow and full of lines.
// This makes FIEMAP even slower, but reduces the lies a little.
fm_flags = FIEMAP_FLAG_SYNC;
}
void
Fiemap::do_ioctl(int fd)
{
CHECK_CONSTRAINT(m_min_count, m_min_count <= m_max_count);
THROW_CHECK1(out_of_range, m_min_count, m_min_count <= m_max_count);
THROW_CHECK1(out_of_range, m_min_count, m_min_count > 0);
auto extent_count = m_min_count;
vector<char> ioctl_arg = vector_copy_struct<fiemap>(this);
const auto extent_count = m_min_count;
ByteVector ioctl_arg(sizeof(fiemap) + extent_count * sizeof(fiemap_extent));
ioctl_arg.resize(sizeof(fiemap) + extent_count * sizeof(fiemap_extent), 0);
fiemap *const ioctl_ptr = ioctl_arg.get<fiemap>();
fiemap *ioctl_ptr = reinterpret_cast<fiemap *>(ioctl_arg.data());
auto start = fm_start;
auto end = fm_start + fm_length;
auto orig_start = fm_start;
auto orig_length = fm_length;
auto start = m_start;
const auto end = m_start + m_length;
vector<FiemapExtent> extents;
while (start < end && extents.size() < m_max_count) {
ioctl_ptr->fm_start = start;
ioctl_ptr->fm_length = end - start;
ioctl_ptr->fm_extent_count = extent_count;
ioctl_ptr->fm_mapped_extents = 0;
*ioctl_ptr = (fiemap) {
.fm_start = start,
.fm_length = end - start,
.fm_flags = m_flags,
.fm_extent_count = extent_count,
};
// cerr << "Before (fd = " << fd << ") : " << ioctl_ptr << endl;
DIE_IF_MINUS_ONE(ioctl(fd, FS_IOC_FIEMAP, ioctl_ptr));
@@ -672,68 +723,107 @@ namespace crucible {
}
}
fiemap *this_ptr = static_cast<fiemap *>(this);
*this_ptr = *ioctl_ptr;
fm_start = orig_start;
fm_length = orig_length;
fm_extent_count = extents.size();
m_extents = extents;
}
BtrfsIoctlSearchKey::BtrfsIoctlSearchKey(size_t buf_size) :
btrfs_ioctl_search_key( (btrfs_ioctl_search_key) {
.max_objectid = numeric_limits<decltype(max_objectid)>::max(),
.max_offset = numeric_limits<decltype(max_offset)>::max(),
.max_transid = numeric_limits<decltype(max_transid)>::max(),
.max_type = numeric_limits<decltype(max_type)>::max(),
.nr_items = 1,
}),
m_buf_size(buf_size)
{
memset_zero<btrfs_ioctl_search_key>(this);
max_objectid = numeric_limits<decltype(max_objectid)>::max();
max_offset = numeric_limits<decltype(max_offset)>::max();
max_transid = numeric_limits<decltype(max_transid)>::max();
max_type = numeric_limits<decltype(max_type)>::max();
nr_items = numeric_limits<decltype(nr_items)>::max();
}
BtrfsIoctlSearchHeader::BtrfsIoctlSearchHeader()
BtrfsIoctlSearchHeader::BtrfsIoctlSearchHeader() :
btrfs_ioctl_search_header( (btrfs_ioctl_search_header) { } )
{
memset_zero<btrfs_ioctl_search_header>(this);
}
size_t
BtrfsIoctlSearchHeader::set_data(const vector<char> &v, size_t offset)
BtrfsIoctlSearchHeader::set_data(const ByteVector &v, size_t offset)
{
THROW_CHECK2(invalid_argument, offset, v.size(), offset + sizeof(btrfs_ioctl_search_header) <= v.size());
memcpy(this, &v[offset], sizeof(btrfs_ioctl_search_header));
memcpy(static_cast<btrfs_ioctl_search_header *>(this), &v[offset], sizeof(btrfs_ioctl_search_header));
offset += sizeof(btrfs_ioctl_search_header);
THROW_CHECK2(invalid_argument, offset + len, v.size(), offset + len <= v.size());
m_data = vector<char>(&v[offset], &v[offset + len]);
m_data = ByteVector(v, offset, len);
return offset + len;
}
thread_local size_t BtrfsIoctlSearchKey::s_calls = 0;
thread_local size_t BtrfsIoctlSearchKey::s_loops = 0;
thread_local size_t BtrfsIoctlSearchKey::s_loops_empty = 0;
thread_local shared_ptr<ostream> BtrfsIoctlSearchKey::s_debug_ostream;
bool
BtrfsIoctlSearchKey::do_ioctl_nothrow(int fd)
{
vector<char> ioctl_arg = vector_copy_struct<btrfs_ioctl_search_key>(this);
ioctl_arg.resize(sizeof(btrfs_ioctl_search_args_v2) + m_buf_size, 0);
btrfs_ioctl_search_args_v2 *ioctl_ptr = reinterpret_cast<btrfs_ioctl_search_args_v2 *>(ioctl_arg.data());
ioctl_ptr->buf_size = m_buf_size;
// Don't bother supporting V1. Kernels that old have other problems.
int rv = ioctl(fd, BTRFS_IOC_TREE_SEARCH_V2, ioctl_ptr);
if (rv != 0) {
return false;
}
static_cast<btrfs_ioctl_search_key&>(*this) = ioctl_ptr->key;
// It would be really nice if the kernel tells us whether our
// buffer overflowed or how big the overflowing object
// was; instead, we have to guess.
m_result.clear();
m_result.reserve(nr_items);
// Make sure there is space for at least the search key and one (empty) header
size_t buf_size = max(m_buf_size, sizeof(btrfs_ioctl_search_args_v2) + sizeof(btrfs_ioctl_search_header));
ByteVector ioctl_arg;
btrfs_ioctl_search_args_v2 *ioctl_ptr;
do {
// ioctl buffer size does not include search key header or buffer size
ioctl_arg = ByteVector(buf_size + sizeof(btrfs_ioctl_search_args_v2));
ioctl_ptr = ioctl_arg.get<btrfs_ioctl_search_args_v2>();
ioctl_ptr->key = static_cast<const btrfs_ioctl_search_key&>(*this);
ioctl_ptr->buf_size = buf_size;
if (s_debug_ostream) {
(*s_debug_ostream) << "bisk " << (ioctl_ptr->key) << "\n";
}
// Don't bother supporting V1. Kernels that old have other problems.
int rv = ioctl(fd, BTRFS_IOC_TREE_SEARCH_V2, ioctl_arg.data());
++s_calls;
if (rv != 0 && errno == ENOENT) {
// If we are searching a tree that is deleted or no longer exists, just return an empty list
ioctl_ptr->key.nr_items = 0;
break;
}
if (rv != 0 && errno != EOVERFLOW) {
return false;
}
if (rv == 0 && nr_items <= ioctl_ptr->key.nr_items) {
// got all the items we wanted, thanks
m_buf_size = max(m_buf_size, buf_size);
break;
}
// Didn't get all the items we wanted. Increase the buf size and try again.
// These sizes are very common on default-formatted btrfs, so use these
// instead of naive doubling.
if (buf_size < 4096) {
buf_size = 4096;
} else if (buf_size < 16384) {
buf_size = 16384;
} else if (buf_size < 65536) {
buf_size = 65536;
} else {
buf_size *= 2;
}
// don't automatically raise the buf size higher than 64K, the largest possible btrfs item
++s_loops;
if (ioctl_ptr->key.nr_items == 0) {
++s_loops_empty;
}
} while (buf_size < 65536);
// ioctl changes nr_items, this has to be copied back
static_cast<btrfs_ioctl_search_key&>(*this) = ioctl_ptr->key;
size_t offset = pointer_distance(ioctl_ptr->buf, ioctl_ptr);
for (decltype(nr_items) i = 0; i < nr_items; ++i) {
BtrfsIoctlSearchHeader item;
offset = item.set_data(ioctl_arg, offset);
m_result.push_back(item);
m_result.insert(item);
}
return true;
}
@@ -741,7 +831,7 @@ namespace crucible {
BtrfsIoctlSearchKey::do_ioctl(int fd)
{
if (!do_ioctl_nothrow(fd)) {
THROW_ERRNO("BTRFS_IOC_TREE_SEARCH_V2: " << name_fd(fd));
THROW_ERRNO("BTRFS_IOC_TREE_SEARCH_V2: " << name_fd(fd) << ": " << *this);
}
}
@@ -752,31 +842,67 @@ namespace crucible {
min_type = ref.type;
min_offset = ref.offset + 1;
if (min_offset < ref.offset) {
// We wrapped, try the next objectid
++min_objectid;
// We wrapped, try the next type
++min_type;
assert(min_offset == 0);
if (min_type < ref.type) {
assert(min_type == 0);
// We wrapped, try the next objectid
++min_objectid;
// no advancement possible at end
THROW_CHECK1(runtime_error, min_type, min_type == 0);
}
}
}
ostream &hexdump(ostream &os, const vector<char> &v)
void
BtrfsIoctlSearchKey::next_min(const BtrfsIoctlSearchHeader &ref, const uint8_t type)
{
os << "vector<char> { size = " << v.size() << ", data:\n";
for (size_t i = 0; i < v.size(); i += 8) {
string hex, ascii;
for (size_t j = i; j < i + 8; ++j) {
if (j < v.size()) {
unsigned char c = v[j];
char buf[8];
sprintf(buf, "%02x ", c);
hex += buf;
ascii += (c < 32 || c > 126) ? '.' : c;
} else {
hex += " ";
ascii += ' ';
}
if (ref.type < type) {
// forward to type in same object with zero offset
min_objectid = ref.objectid;
min_type = type;
min_offset = 0;
} else if (ref.type > type) {
// skip directly to start of next objectid with target type
min_objectid = ref.objectid + 1;
// no advancement possible at end
THROW_CHECK2(out_of_range, min_objectid, ref.objectid, min_objectid > ref.objectid);
min_type = type;
min_offset = 0;
} else {
// advance within this type
min_objectid = ref.objectid;
min_type = ref.type;
min_offset = ref.offset + 1;
if (min_offset < ref.offset) {
// We wrapped, try the next objectid, same type
++min_objectid;
THROW_CHECK2(out_of_range, min_objectid, ref.objectid, min_objectid > ref.objectid);
min_type = type;
assert(min_offset == 0);
}
os << astringprintf("\t%08x %s %s\n", i, hex.c_str(), ascii.c_str());
}
return os << "}";
}
string
btrfs_chunk_type_ntoa(uint64_t type)
{
static const bits_ntoa_table table[] = {
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DATA),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_METADATA),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_SYSTEM),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_DUP),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID0),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID10),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C3),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID1C4),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID5),
NTOA_TABLE_ENTRY_BITS(BTRFS_BLOCK_GROUP_RAID6),
NTOA_TABLE_ENTRY_END()
};
return bits_ntoa(type, table);
}
string
@@ -806,15 +932,9 @@ namespace crucible {
NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_BLOCK_REF_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_SHARED_DATA_REF_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_BLOCK_GROUP_ITEM_KEY),
#ifdef BTRFS_FREE_SPACE_INFO_KEY
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_INFO_KEY),
#endif
#ifdef BTRFS_FREE_SPACE_EXTENT_KEY
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_EXTENT_KEY),
#endif
#ifdef BTRFS_FREE_SPACE_BITMAP_KEY
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_BITMAP_KEY),
#endif
NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_EXTENT_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_DEV_ITEM_KEY),
NTOA_TABLE_ENTRY_ENUM(BTRFS_CHUNK_ITEM_KEY),
@@ -834,7 +954,7 @@ namespace crucible {
}
string
btrfs_search_objectid_ntoa(unsigned objectid)
btrfs_search_objectid_ntoa(uint64_t objectid)
{
static const bits_ntoa_table table[] = {
NTOA_TABLE_ENTRY_ENUM(BTRFS_ROOT_TREE_OBJECTID),
@@ -846,9 +966,7 @@ namespace crucible {
NTOA_TABLE_ENTRY_ENUM(BTRFS_CSUM_TREE_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_QUOTA_TREE_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_UUID_TREE_OBJECTID),
#ifdef BTRFS_FREE_SPACE_TREE_OBJECTID
NTOA_TABLE_ENTRY_ENUM(BTRFS_FREE_SPACE_TREE_OBJECTID),
#endif
NTOA_TABLE_ENTRY_ENUM(BTRFS_BALANCE_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_ORPHAN_OBJECTID),
NTOA_TABLE_ENTRY_ENUM(BTRFS_TREE_LOG_OBJECTID),
@@ -906,7 +1024,7 @@ namespace crucible {
ostream &
operator<<(ostream &os, const BtrfsIoctlSearchHeader &hdr)
{
os << "BtrfsIoctlSearchHeader { "
os << "BtrfsIoctlSearchHeader { "
<< static_cast<const btrfs_ioctl_search_header &>(hdr)
<< ", data = ";
hexdump(os, hdr.m_data);
@@ -916,7 +1034,7 @@ namespace crucible {
ostream &
operator<<(ostream &os, const BtrfsIoctlSearchKey &key)
{
os << "BtrfsIoctlSearchKey { "
os << "BtrfsIoctlSearchKey { "
<< static_cast<const btrfs_ioctl_search_key &>(key)
<< ", buf_size = " << key.m_buf_size
<< ", buf[" << key.m_result.size() << "] = {";
@@ -961,7 +1079,7 @@ namespace crucible {
}
if (i.objectid == root_id && i.type == BTRFS_ROOT_ITEM_KEY) {
rv = max(rv, uint64_t(call_btrfs_get(btrfs_root_generation, i.m_data)));
rv = max(rv, uint64_t(btrfs_get_member(&btrfs_root_item::generation, i.m_data)));
}
}
if (sk.min_offset < numeric_limits<decltype(sk.min_offset)>::max()) {
@@ -973,9 +1091,9 @@ namespace crucible {
return rv;
}
Statvfs::Statvfs()
Statvfs::Statvfs() :
statvfs( (statvfs) { } )
{
memset_zero<statvfs>(this);
}
Statvfs::Statvfs(int fd) :
@@ -1014,7 +1132,6 @@ namespace crucible {
os << "BtrfsIoctlFsInfoArgs {"
<< " max_id = " << a.max_id << ","
<< " num_devices = " << a.num_devices << ","
<< " fsid = " << a.uuid() << ","
#if 0
<< " nodesize = " << a.nodesize << ","
<< " sectorsize = " << a.sectorsize << ","
@@ -1027,24 +1144,54 @@ namespace crucible {
return os << " }";
};
BtrfsIoctlFsInfoArgs::BtrfsIoctlFsInfoArgs()
BtrfsIoctlFsInfoArgs::BtrfsIoctlFsInfoArgs() :
btrfs_ioctl_fs_info_args_v3( (btrfs_ioctl_fs_info_args_v3) {
.flags = 0
| BTRFS_FS_INFO_FLAG_CSUM_INFO
| BTRFS_FS_INFO_FLAG_GENERATION
,
})
{
memset_zero<btrfs_ioctl_fs_info_args>(this);
}
bool
BtrfsIoctlFsInfoArgs::do_ioctl_nothrow(int const fd)
{
btrfs_ioctl_fs_info_args_v3 *p = static_cast<btrfs_ioctl_fs_info_args_v3 *>(this);
return 0 == ioctl(fd, BTRFS_IOC_FS_INFO, p);
}
void
BtrfsIoctlFsInfoArgs::do_ioctl(int fd)
BtrfsIoctlFsInfoArgs::do_ioctl(int const fd)
{
btrfs_ioctl_fs_info_args *p = static_cast<btrfs_ioctl_fs_info_args *>(this);
if (ioctl(fd, BTRFS_IOC_FS_INFO, p)) {
if (!do_ioctl_nothrow(fd)) {
THROW_ERRNO("BTRFS_IOC_FS_INFO: fd " << fd);
}
}
string
BtrfsIoctlFsInfoArgs::uuid() const
uint16_t
BtrfsIoctlFsInfoArgs::csum_type() const
{
return uuid_unparse(fsid);
return this->btrfs_ioctl_fs_info_args_v3::csum_type;
}
uint16_t
BtrfsIoctlFsInfoArgs::csum_size() const
{
return this->btrfs_ioctl_fs_info_args_v3::csum_size;
}
vector<uint8_t>
BtrfsIoctlFsInfoArgs::fsid() const
{
const auto begin = btrfs_ioctl_fs_info_args_v3::fsid;
return vector<uint8_t>(begin, begin + BTRFS_FSID_SIZE);
}
uint64_t
BtrfsIoctlFsInfoArgs::generation() const
{
return this->btrfs_ioctl_fs_info_args_v3::generation;
}
};

View File

@@ -1,96 +0,0 @@
#include "crucible/interp.h"
#include "crucible/chatter.h"
namespace crucible {
using namespace std;
int
Proc::exec(const ArgList &args)
{
return m_cmd(args);
}
Proc::Proc(const function<int(const ArgList &)> &f) :
m_cmd(f)
{
}
Command::~Command()
{
}
ArgList::ArgList(const char **argv)
{
while (argv && *argv) {
push_back(*argv++);
}
}
ArgList::ArgList(const vector<string> &&that) :
vector<string>(that)
{
}
Interp::~Interp()
{
}
Interp::Interp(const map<string, shared_ptr<Command> > &cmdlist) :
m_commands(cmdlist)
{
}
void
Interp::add_command(const string &name, const shared_ptr<Command> &command)
{
m_commands[name] = command;
}
int
Interp::exec(const ArgList &args)
{
auto next_arg = args.begin();
++next_arg;
return m_commands.at(args[0])->exec(vector<string>(next_arg, args.end()));
}
ArgParser::~ArgParser()
{
}
ArgParser::ArgParser()
{
}
void
ArgParser::add_opt(string opt, ArgActor actor)
{
m_string_opts[opt] = actor;
}
void
ArgParser::parse_backend(void *t, const ArgList &args)
{
bool quote_args = false;
for (string arg : args) {
if (quote_args) {
cerr << "arg: '" << arg << "'" << endl;
continue;
}
if (arg == "--") {
quote_args = true;
continue;
}
if (arg.compare(0, 2, "--") == 0) {
auto found = m_string_opts.find(arg.substr(2, string::npos));
if (found != m_string_opts.end()) {
found->second.predicate(t, "foo");
}
(void)t;
}
}
}
};

83
lib/multilock.cc Normal file
View File

@@ -0,0 +1,83 @@
#include "crucible/multilock.h"
#include "crucible/error.h"
namespace crucible {
using namespace std;
MultiLocker::LockHandle::LockHandle(const string &type, MultiLocker &parent) :
m_type(type),
m_parent(parent)
{
}
void
MultiLocker::LockHandle::set_locked(const bool state)
{
m_locked = state;
}
MultiLocker::LockHandle::~LockHandle()
{
if (m_locked) {
m_parent.put_lock(m_type);
m_locked = false;
}
}
bool
MultiLocker::is_lock_available(const string &type)
{
for (const auto &i : m_counters) {
if (i.second != 0 && i.first != type) {
return false;
}
}
return true;
}
void
MultiLocker::put_lock(const string &type)
{
unique_lock<mutex> lock(m_mutex);
auto &counter = m_counters[type];
THROW_CHECK2(runtime_error, type, counter, counter > 0);
--counter;
if (counter == 0) {
m_cv.notify_all();
}
}
shared_ptr<MultiLocker::LockHandle>
MultiLocker::get_lock_private(const string &type)
{
unique_lock<mutex> lock(m_mutex);
m_counters.insert(make_pair(type, size_t(0)));
while (!is_lock_available(type)) {
m_cv.wait(lock);
}
const auto rv = make_shared<LockHandle>(type, *this);
++m_counters[type];
rv->set_locked(true);
return rv;
}
static MultiLocker s_process_instance;
shared_ptr<MultiLocker::LockHandle>
MultiLocker::get_lock(const string &type)
{
if (s_process_instance.m_do_locking) {
return s_process_instance.get_lock_private(type);
} else {
return shared_ptr<MultiLocker::LockHandle>();
}
}
void
MultiLocker::enable_locking(const bool enabled)
{
s_process_instance.m_do_locking = enabled;
}
}

View File

@@ -1,18 +1,17 @@
#include "crucible/ntoa.h"
#include <cassert>
#include <sstream>
#include <string>
#include "crucible/error.h"
#include "crucible/string.h"
namespace crucible {
using namespace std;
string bits_ntoa(unsigned long n, const bits_ntoa_table *table)
string bits_ntoa(unsigned long long n, const bits_ntoa_table *table)
{
string out;
while (n && table->a) {
// No bits in n outside of mask
assert( ((~table->mask) & table->n) == 0);
THROW_CHECK2(invalid_argument, table->mask, table->n, ((~table->mask) & table->n) == 0);
if ( (n & table->mask) == table->n) {
if (!out.empty()) {
out += "|";
@@ -23,12 +22,10 @@ namespace crucible {
++table;
}
if (n) {
ostringstream oss;
oss << "0x" << hex << n;
if (!out.empty()) {
out += "|";
}
out += oss.str();
out += to_hex(n);
}
if (out.empty()) {
out = "0";

40
lib/openat2.cc Normal file
View File

@@ -0,0 +1,40 @@
#include "crucible/openat2.h"
#include <sys/syscall.h>
// Compatibility for building on old libc for new kernel
#if LINUX_VERSION_CODE < KERNEL_VERSION(5, 6, 0)
// Every arch that defines this uses 437, except Alpha, where 437 is
// mq_getsetattr.
#ifndef SYS_openat2
#ifdef __alpha__
#define SYS_openat2 547
#else
#define SYS_openat2 437
#endif
#endif
#endif // Linux version >= v5.6
#include <fcntl.h>
#include <unistd.h>
extern "C" {
int
__attribute__((weak))
openat2(int const dirfd, const char *const pathname, struct open_how *const how, size_t const size)
throw()
{
#ifdef SYS_openat2
return syscall(SYS_openat2, dirfd, pathname, how, size);
#else
errno = ENOSYS;
return -1;
#endif
}
};

View File

@@ -2,16 +2,23 @@
#include "crucible/chatter.h"
#include "crucible/error.h"
#include "crucible/ntoa.h"
#include <cstdlib>
#include <utility>
// for gettid()
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <unistd.h>
#include <sys/syscall.h>
extern "C" {
pid_t
__attribute__((weak))
gettid() throw()
{
return syscall(SYS_gettid);
}
};
namespace crucible {
using namespace std;
@@ -109,13 +116,102 @@ namespace crucible {
}
}
template<>
struct ResourceHandle<Process::id, Process>;
pid_t
gettid()
double
getloadavg1()
{
return syscall(SYS_gettid);
double loadavg[1];
const int rv = ::getloadavg(loadavg, 1);
if (rv != 1) {
THROW_ERRNO("getloadavg(..., 1)");
}
return loadavg[0];
}
double
getloadavg5()
{
double loadavg[2];
const int rv = ::getloadavg(loadavg, 2);
if (rv != 2) {
THROW_ERRNO("getloadavg(..., 2)");
}
return loadavg[1];
}
double
getloadavg15()
{
double loadavg[3];
const int rv = ::getloadavg(loadavg, 3);
if (rv != 3) {
THROW_ERRNO("getloadavg(..., 3)");
}
return loadavg[2];
}
static const struct bits_ntoa_table signals_table[] = {
// POSIX.1-1990
NTOA_TABLE_ENTRY_ENUM(SIGHUP),
NTOA_TABLE_ENTRY_ENUM(SIGINT),
NTOA_TABLE_ENTRY_ENUM(SIGQUIT),
NTOA_TABLE_ENTRY_ENUM(SIGILL),
NTOA_TABLE_ENTRY_ENUM(SIGABRT),
NTOA_TABLE_ENTRY_ENUM(SIGFPE),
NTOA_TABLE_ENTRY_ENUM(SIGKILL),
NTOA_TABLE_ENTRY_ENUM(SIGSEGV),
NTOA_TABLE_ENTRY_ENUM(SIGPIPE),
NTOA_TABLE_ENTRY_ENUM(SIGALRM),
NTOA_TABLE_ENTRY_ENUM(SIGTERM),
NTOA_TABLE_ENTRY_ENUM(SIGUSR1),
NTOA_TABLE_ENTRY_ENUM(SIGUSR2),
NTOA_TABLE_ENTRY_ENUM(SIGCHLD),
NTOA_TABLE_ENTRY_ENUM(SIGCONT),
NTOA_TABLE_ENTRY_ENUM(SIGSTOP),
NTOA_TABLE_ENTRY_ENUM(SIGTSTP),
NTOA_TABLE_ENTRY_ENUM(SIGTTIN),
NTOA_TABLE_ENTRY_ENUM(SIGTTOU),
// SUSv2 and POSIX.1-2001
NTOA_TABLE_ENTRY_ENUM(SIGBUS),
NTOA_TABLE_ENTRY_ENUM(SIGPOLL),
NTOA_TABLE_ENTRY_ENUM(SIGPROF),
NTOA_TABLE_ENTRY_ENUM(SIGSYS),
NTOA_TABLE_ENTRY_ENUM(SIGTRAP),
NTOA_TABLE_ENTRY_ENUM(SIGURG),
NTOA_TABLE_ENTRY_ENUM(SIGVTALRM),
NTOA_TABLE_ENTRY_ENUM(SIGXCPU),
NTOA_TABLE_ENTRY_ENUM(SIGXFSZ),
// Other
NTOA_TABLE_ENTRY_ENUM(SIGIOT),
#ifdef SIGEMT
NTOA_TABLE_ENTRY_ENUM(SIGEMT),
#endif
NTOA_TABLE_ENTRY_ENUM(SIGSTKFLT),
NTOA_TABLE_ENTRY_ENUM(SIGIO),
#ifdef SIGCLD
NTOA_TABLE_ENTRY_ENUM(SIGCLD),
#endif
NTOA_TABLE_ENTRY_ENUM(SIGPWR),
#ifdef SIGINFO
NTOA_TABLE_ENTRY_ENUM(SIGINFO),
#endif
#ifdef SIGLOST
NTOA_TABLE_ENTRY_ENUM(SIGLOST),
#endif
NTOA_TABLE_ENTRY_ENUM(SIGWINCH),
#ifdef SIGUNUSED
NTOA_TABLE_ENTRY_ENUM(SIGUNUSED),
#endif
NTOA_TABLE_ENTRY_END(),
};
string
signal_ntoa(int sig)
{
return bits_ntoa(sig, signals_table);
}
}

7
lib/seeker.cc Normal file
View File

@@ -0,0 +1,7 @@
#include "crucible/seeker.h"
namespace crucible {
thread_local shared_ptr<ostream> tl_seeker_debug_str;
};

View File

@@ -16,7 +16,7 @@ namespace crucible {
uint64_t
from_hex(const string &s)
{
return stoull(s, 0, 0);
return stoull(s, nullptr, 0);
}
vector<string>

254
lib/table.cc Normal file
View File

@@ -0,0 +1,254 @@
#include "crucible/table.h"
#include "crucible/string.h"
namespace crucible {
namespace Table {
using namespace std;
Content
Fill(const char c)
{
return [=](size_t width, size_t height) -> string {
string rv;
while (height--) {
rv += string(width, c);
if (height) {
rv += "\n";
}
}
return rv;
};
}
Content
Text(const string &s)
{
return [=](size_t width, size_t height) -> string {
const auto lines = split("\n", s);
string rv;
size_t line_count = 0;
for (const auto &i : lines) {
if (line_count++) {
rv += "\n";
}
if (i.length() < width) {
rv += string(width - i.length(), ' ');
}
rv += i;
}
while (line_count < height) {
if (line_count++) {
rv += "\n";
}
rv += string(width, ' ');
}
return rv;
};
}
Content
Number(const string &s)
{
return [=](size_t width, size_t height) -> string {
const auto lines = split("\n", s);
string rv;
size_t line_count = 0;
for (const auto &i : lines) {
if (line_count++) {
rv += "\n";
}
if (i.length() < width) {
rv += string(width - i.length(), ' ');
}
rv += i;
}
while (line_count < height) {
if (line_count++) {
rv += "\n";
}
rv += string(width, ' ');
}
return rv;
};
}
Cell::Cell(const Content &fn) :
m_content(fn)
{
}
Cell&
Cell::operator=(const Content &fn)
{
m_content = fn;
return *this;
}
string
Cell::text(size_t width, size_t height) const
{
return m_content(width, height);
}
size_t
Dimension::size() const
{
return m_elements.size();
}
size_t
Dimension::insert(size_t pos)
{
++m_next_pos;
const auto insert_pos = min(m_elements.size(), pos);
const auto it = m_elements.begin() + insert_pos;
m_elements.insert(it, m_next_pos);
return insert_pos;
}
void
Dimension::erase(size_t pos)
{
const auto it = m_elements.begin() + min(m_elements.size(), pos);
m_elements.erase(it);
}
size_t
Dimension::at(size_t pos) const
{
return m_elements.at(pos);
}
Dimension&
Table::rows()
{
return m_rows;
};
const Dimension&
Table::rows() const
{
return m_rows;
};
Dimension&
Table::cols()
{
return m_cols;
};
const Dimension&
Table::cols() const
{
return m_cols;
};
const Cell&
Table::at(size_t row, size_t col) const
{
const auto row_idx = m_rows.at(row);
const auto col_idx = m_cols.at(col);
const auto found = m_cells.find(make_pair(row_idx, col_idx));
if (found == m_cells.end()) {
static const Cell s_empty(Fill('.'));
return s_empty;
}
return found->second;
};
Cell&
Table::at(size_t row, size_t col)
{
const auto row_idx = m_rows.at(row);
const auto col_idx = m_cols.at(col);
return m_cells[make_pair(row_idx, col_idx)];
};
static
pair<size_t, size_t>
text_size(const string &s)
{
const auto s_split = split("\n", s);
size_t width = 0;
for (const auto &i : s_split) {
width = max(width, i.length());
}
return make_pair(width, s_split.size());
}
ostream& operator<<(ostream &os, const Table &table)
{
const auto rows = table.rows().size();
const auto cols = table.cols().size();
vector<size_t> row_heights(rows, 1);
vector<size_t> col_widths(cols, 1);
// Get the size of all fixed- and minimum-sized content cells
for (size_t row = 0; row < table.rows().size(); ++row) {
vector<string> col_text;
for (size_t col = 0; col < table.cols().size(); ++col) {
col_text.push_back(table.at(row, col).text(0, 0));
const auto tsize = text_size(*col_text.rbegin());
row_heights[row] = max(row_heights[row], tsize.second);
col_widths[col] = max(col_widths[col], tsize.first);
}
}
// Render the table
for (size_t row = 0; row < table.rows().size(); ++row) {
vector<string> lines(row_heights[row], "");
for (size_t col = 0; col < table.cols().size(); ++col) {
const auto& table_cell = table.at(row, col);
const auto table_text = table_cell.text(col_widths[col], row_heights[row]);
auto col_lines = split("\n", table_text);
col_lines.resize(row_heights[row], "");
for (size_t line = 0; line < row_heights[row]; ++line) {
if (col > 0) {
lines[line] += table.mid();
}
lines[line] += col_lines[line];
}
}
for (const auto &line : lines) {
os << table.left() << line << table.right() << "\n";
}
}
return os;
}
void
Table::left(const string &s)
{
m_left = s;
}
void
Table::mid(const string &s)
{
m_mid = s;
}
void
Table::right(const string &s)
{
m_right = s;
}
const string&
Table::left() const
{
return m_left;
}
const string&
Table::mid() const
{
return m_mid;
}
const string&
Table::right() const
{
return m_right;
}
}
}

1034
lib/task.cc Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -1,11 +1,13 @@
#include "crucible/time.h"
#include "crucible/error.h"
#include "crucible/process.h"
#include <algorithm>
#include <thread>
#include <cmath>
#include <ctime>
#include <thread>
namespace crucible {
@@ -59,16 +61,10 @@ namespace crucible {
m_start = chrono::high_resolution_clock::now();
}
void
Timer::set(const chrono::high_resolution_clock::time_point &start)
chrono::high_resolution_clock::time_point
Timer::get() const
{
m_start = start;
}
void
Timer::set(double delta)
{
m_start += chrono::duration_cast<chrono::high_resolution_clock::duration>(chrono::duration<double>(delta));
return m_start;
}
double
@@ -102,12 +98,16 @@ namespace crucible {
m_rate(rate),
m_burst(burst)
{
THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
THROW_CHECK1(invalid_argument, m_burst, m_burst >= 0);
}
RateLimiter::RateLimiter(double rate) :
m_rate(rate),
m_burst(rate)
{
THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
THROW_CHECK1(invalid_argument, m_burst, m_burst >= 0);
}
void
@@ -120,23 +120,27 @@ namespace crucible {
}
}
double
RateLimiter::sleep_time(double cost)
{
THROW_CHECK1(invalid_argument, m_rate, m_rate > 0);
borrow(cost);
unique_lock<mutex> lock(m_mutex);
update_tokens();
if (m_tokens >= 0) {
return 0;
}
return -m_tokens / m_rate;
}
void
RateLimiter::sleep_for(double cost)
{
borrow(cost);
while (1) {
unique_lock<mutex> lock(m_mutex);
update_tokens();
if (m_tokens >= 0) {
return;
}
double sleep_time(-m_tokens / m_rate);
lock.unlock();
if (sleep_time > 0.0) {
nanosleep(sleep_time);
} else {
return;
}
double time_to_sleep = sleep_time(cost);
if (time_to_sleep > 0.0) {
nanosleep(time_to_sleep);
} else {
return;
}
}
@@ -155,4 +159,211 @@ namespace crucible {
m_tokens -= cost;
}
void
RateLimiter::rate(double const new_rate)
{
THROW_CHECK1(invalid_argument, new_rate, new_rate > 0);
unique_lock<mutex> lock(m_mutex);
m_rate = new_rate;
}
double
RateLimiter::rate() const
{
unique_lock<mutex> lock(m_mutex);
return m_rate;
}
RateEstimator::RateEstimator(double min_delay, double max_delay) :
m_min_delay(min_delay),
m_max_delay(max_delay)
{
THROW_CHECK1(invalid_argument, min_delay, min_delay > 0);
THROW_CHECK1(invalid_argument, max_delay, max_delay > 0);
THROW_CHECK2(invalid_argument, min_delay, max_delay, max_delay > min_delay);
}
void
RateEstimator::update_unlocked(uint64_t new_count)
{
// Gradually reduce the effect of previous updates
if (m_last_decay.age() > 1) {
m_num *= m_decay;
m_den *= m_decay;
m_last_decay.reset();
}
// Add units over time to running totals
auto increment = new_count - min(new_count, m_last_count);
auto delta = max(0.0, m_last_update.lap());
m_num += increment;
m_den += delta;
m_last_count = new_count;
// If count increased, wake up any waiters
if (delta > 0) {
m_condvar.notify_all();
}
}
void
RateEstimator::update(uint64_t new_count)
{
unique_lock<mutex> lock(m_mutex);
return update_unlocked(new_count);
}
void
RateEstimator::update_monotonic(uint64_t new_count)
{
unique_lock<mutex> lock(m_mutex);
if (m_last_count == numeric_limits<uint64_t>::max() || new_count > m_last_count) {
return update_unlocked(new_count);
} else {
return update_unlocked(m_last_count);
}
}
void
RateEstimator::increment(const uint64_t more)
{
unique_lock<mutex> lock(m_mutex);
return update_unlocked(m_last_count + more);
}
uint64_t
RateEstimator::count() const
{
unique_lock<mutex> lock(m_mutex);
return m_last_count;
}
pair<double, double>
RateEstimator::ratio_unlocked() const
{
auto num = max(m_num, 1.0);
// auto den = max(m_den, 1.0);
// Rate estimation slows down if there are no new units to count
auto den = max(m_den + m_last_update.age(), 1.0);
auto sec_per_count = den / num;
if (sec_per_count < m_min_delay) {
return make_pair(1.0, m_min_delay);
}
if (sec_per_count > m_max_delay) {
return make_pair(1.0, m_max_delay);
}
return make_pair(num, den);
}
pair<double, double>
RateEstimator::ratio() const
{
unique_lock<mutex> lock(m_mutex);
return ratio_unlocked();
}
pair<double, double>
RateEstimator::raw() const
{
unique_lock<mutex> lock(m_mutex);
return make_pair(m_num, m_den);
}
double
RateEstimator::rate_unlocked() const
{
auto r = ratio_unlocked();
return r.first / r.second;
}
double
RateEstimator::rate() const
{
unique_lock<mutex> lock(m_mutex);
return rate_unlocked();
}
ostream &
operator<<(ostream &os, const RateEstimator &re)
{
os << "RateEstimator { ";
auto ratio = re.ratio();
auto raw = re.raw();
os << "count = " << re.count() << ", raw = " << raw.first << " / " << raw.second << ", ratio = " << ratio.first << " / " << ratio.second << ", rate = " << re.rate() << ", duration(1) = " << re.duration(1).count() << ", seconds_for(1) = " << re.seconds_for(1) << " }";
return os;
}
chrono::duration<double>
RateEstimator::duration_unlocked(uint64_t relative_count) const
{
auto dur = relative_count / rate_unlocked();
dur = min(m_max_delay, dur);
dur = max(m_min_delay, dur);
return chrono::duration<double>(dur);
}
chrono::duration<double>
RateEstimator::duration(uint64_t relative_count) const
{
unique_lock<mutex> lock(m_mutex);
return duration_unlocked(relative_count);
}
chrono::high_resolution_clock::time_point
RateEstimator::time_point_unlocked(uint64_t absolute_count) const
{
auto relative_count = absolute_count - min(m_last_count, absolute_count);
auto relative_duration = duration_unlocked(relative_count);
return m_last_update.get() + chrono::duration_cast<chrono::high_resolution_clock::duration>(relative_duration);
// return chrono::high_resolution_clock::now() + chrono::duration_cast<chrono::high_resolution_clock::duration>(relative_duration);
}
chrono::high_resolution_clock::time_point
RateEstimator::time_point(uint64_t absolute_count) const
{
unique_lock<mutex> lock(m_mutex);
return time_point_unlocked(absolute_count);
}
void
RateEstimator::wait_until(uint64_t new_count_absolute) const
{
unique_lock<mutex> lock(m_mutex);
auto saved_count = m_last_count;
while (saved_count <= m_last_count && m_last_count < new_count_absolute) {
// Stop waiting if clock runs backwards
saved_count = m_last_count;
m_condvar.wait(lock);
}
}
void
RateEstimator::wait_for(uint64_t new_count_relative) const
{
unique_lock<mutex> lock(m_mutex);
auto saved_count = m_last_count;
auto new_count_absolute = m_last_count + new_count_relative;
while (saved_count <= m_last_count && m_last_count < new_count_absolute) {
// Stop waiting if clock runs backwards
saved_count = m_last_count;
m_condvar.wait(lock);
}
}
double
RateEstimator::seconds_for(uint64_t new_count_relative) const
{
unique_lock<mutex> lock(m_mutex);
auto ts = time_point_unlocked(new_count_relative + m_last_count);
auto delta_dur = ts - chrono::high_resolution_clock::now();
return max(min(chrono::duration<double>(delta_dur).count(), m_max_delay), m_min_delay);
}
double
RateEstimator::seconds_until(uint64_t new_count_absolute) const
{
unique_lock<mutex> lock(m_mutex);
auto ts = time_point_unlocked(new_count_absolute);
auto delta_dur = ts - chrono::high_resolution_clock::now();
return max(min(chrono::duration<double>(delta_dur).count(), m_max_delay), m_min_delay);
}
}

11
lib/uname.cc Normal file
View File

@@ -0,0 +1,11 @@
#include "crucible/error.h"
#include "crucible/uname.h"
namespace crucible {
using namespace std;
Uname::Uname()
{
DIE_IF_NON_ZERO(uname(static_cast<utsname*>(this)));
}
}

View File

@@ -1,16 +0,0 @@
#include "crucible/uuid.h"
namespace crucible {
using namespace std;
const size_t uuid_unparsed_size = 37; // "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\0"
string
uuid_unparse(const unsigned char in[16])
{
char out[uuid_unparsed_size];
::uuid_unparse(in, out);
return string(out);
}
}

View File

@@ -1,4 +1,13 @@
CCFLAGS = -Wall -Wextra -Werror -O3 -I../include -ggdb -fpic
# CCFLAGS = -Wall -Wextra -Werror -O0 -I../include -ggdb -fpic
CFLAGS = $(CCFLAGS) -std=c99
CXXFLAGS = $(CCFLAGS) -std=c++11 -Wold-style-cast
# Default:
CCFLAGS = -Wall -Wextra -Werror -O3
# Optimized:
# CCFLAGS = -Wall -Wextra -Werror -O3 -march=native
# Debug:
# CCFLAGS = -Wall -Wextra -Werror -O0 -ggdb
CCFLAGS += -I../include -D_FILE_OFFSET_BITS=64
BEES_CFLAGS = $(CCFLAGS) -std=c99 $(CFLAGS)
BEES_CXXFLAGS = $(CCFLAGS) -std=c++11 -Wold-style-cast -Wno-missing-field-initializers $(CXXFLAGS)

34
scripts/beesd.conf.sample Normal file
View File

@@ -0,0 +1,34 @@
## Config for Bees: /etc/bees/beesd.conf.sample
## https://github.com/Zygo/bees
## It's a default values, change it, if needed
# How to use?
# Copy this file to a new file name and adjust the UUID below
# Which FS will be used
UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
## System Vars
# Change carefully
# WORK_DIR=/run/bees/
# MNT_DIR="$WORK_DIR/mnt/$UUID"
# BEESHOME="$MNT_DIR/.beeshome"
# BEESSTATUS="$WORK_DIR/$UUID.status"
## Options to apply, see `beesd --help` for details
# OPTIONS="--strip-paths --no-timestamps"
## Bees DB size
# Hash Table Sizing
# sHash table entries are 16 bytes each
# (64-bit hash, 52-bit block number, and some metadata bits)
# Each entry represents a minimum of 4K on disk.
# unique data size hash table size average dedupe block size
# 1TB 4GB 4K
# 1TB 1GB 16K
# 1TB 256MB 64K
# 1TB 16MB 1024K
# 64TB 1GB 1024K
#
# Size MUST be multiple of 128KB
# DB_SIZE=$((1024*1024*1024)) # 1G in bytes

146
scripts/beesd.in Executable file
View File

@@ -0,0 +1,146 @@
#!/bin/bash
# if not called from systemd try to replicate mount unsharing on ctrl+c
# see: https://github.com/Zygo/bees/issues/281
if [ -z "${SYSTEMD_EXEC_PID}" -a -z "${UNSHARE_DONE}" ]; then
UNSHARE_DONE=true
export UNSHARE_DONE
exec unshare -m --propagation private -- "$0" "$@"
fi
## Helpful functions
INFO(){ echo "INFO:" "$@"; }
ERRO(){ echo "ERROR:" "$@"; exit 1; }
YN(){ [[ "$1" =~ (1|Y|y) ]]; }
## Global vars
export BEESHOME BEESSTATUS
export WORK_DIR CONFIG_DIR
export CONFIG_FILE
export UUID AL16M AL128K
readonly AL128K="$((128*1024))"
readonly AL16M="$((16*1024*1024))"
readonly CONFIG_DIR=@ETC_PREFIX@/bees/
readonly bees_bin=$(realpath @DESTDIR@/@LIBEXEC_PREFIX@/bees)
command -v "$bees_bin" &> /dev/null || ERRO "Missing 'bees' agent"
uuid_valid(){
if uuidparse -n -o VARIANT $1 | grep -i -q invalid; then
false
fi
}
help(){
echo "Usage: beesd [options] <btrfs_uuid>"
echo "- - -"
exec "$bees_bin" --help
}
for i in $("$bees_bin" --help 2>&1 | grep -E " --" | sed -e "s/^[^-]*-/-/" -e "s/,[^-]*--/ --/" -e "s/ [^-]*$//")
do
TMP_ARGS="$TMP_ARGS $i"
done
IFS=" " read -r -a SUPPORTED_ARGS <<< $TMP_ARGS
NOT_SUPPORTED_ARGS=()
ARGUMENTS=()
for arg in "${@}"; do
supp=false
for supp_arg in "${SUPPORTED_ARGS[@]}"; do
if [[ "$arg" == ${supp_arg}* ]]; then
supp=true
break
fi
done
if $supp; then
ARGUMENTS+=($arg)
else
NOT_SUPPORTED_ARGS+=($arg)
fi
done
for arg in "${ARGUMENTS[@]}"; do
case $arg in
-h) help;;
--help) help;;
esac
done
for arg in "${NOT_SUPPORTED_ARGS[@]}"; do
if uuid_valid $arg; then
[ ! -z "$UUID" ] && help
UUID=$arg
fi
done
[ -z "$UUID" ] && help
FILE_CONFIG="$(grep -E -l '^[^#]*UUID\s*=\s*"?'"$UUID" "$CONFIG_DIR"/*.conf | head -1)"
[ ! -f "$FILE_CONFIG" ] && ERRO "No config for $UUID"
INFO "Find $UUID in $FILE_CONFIG, use as conf"
source "$FILE_CONFIG"
## Pre checks
{
[ ! -d "$CONFIG_DIR" ] && ERRO "Missing: $CONFIG_DIR"
[ "$UID" == "0" ] || ERRO "Must be run as root"
}
WORK_DIR="${WORK_DIR:-/run/bees/}"
MNT_DIR="${MNT_DIR:-$WORK_DIR/mnt/$UUID}"
BEESHOME="${BEESHOME:-$MNT_DIR/.beeshome}"
BEESSTATUS="${BEESSTATUS:-$WORK_DIR/$UUID.status}"
DB_SIZE="${DB_SIZE:-$((8192*AL128K))}"
INFO "Check: Disk exists"
if [ ! -b "/dev/disk/by-uuid/$UUID" ]; then
ERRO "Missing disk: /dev/disk/by-uuid/$UUID"
fi
is_btrfs(){ [ "$(blkid -s TYPE -o value "$1")" == "btrfs" ]; }
INFO "Check: Disk with btrfs"
if ! is_btrfs "/dev/disk/by-uuid/$UUID"; then
ERRO "Disk not contain btrfs: /dev/disk/by-uuid/$UUID"
fi
INFO "WORK DIR: $WORK_DIR"
mkdir -p "$WORK_DIR" || exit 1
INFO "MOUNT DIR: $MNT_DIR"
mkdir -p "$MNT_DIR" || exit 1
mount --make-private -osubvolid=5,nodev,noexec /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
if [ ! -d "$BEESHOME" ]; then
INFO "Create subvol $BEESHOME for store bees data"
btrfs sub cre "$BEESHOME"
fi
# Check DB size
{
DB_PATH="$BEESHOME/beeshash.dat"
touch "$DB_PATH"
OLD_SIZE="$(du -b "$DB_PATH" | sed 's/\t/ /g' | cut -d' ' -f1)"
NEW_SIZE="$DB_SIZE"
if (( "$NEW_SIZE"%AL128K > 0 )); then
ERRO "DB_SIZE Must be multiple of 128K"
fi
if (( "$OLD_SIZE" != "$NEW_SIZE" )); then
INFO "Resize db: $OLD_SIZE -> $NEW_SIZE"
rm -f "$BEESHOME/beescrawl.dat"
truncate -s $NEW_SIZE $DB_PATH
fi
chmod 700 "$DB_PATH"
}
MNT_DIR="$(realpath $MNT_DIR)"
cd "$MNT_DIR"
exec "$bees_bin" "${ARGUMENTS[@]}" $OPTIONS "$MNT_DIR"

60
scripts/beesd@.service.in Normal file
View File

@@ -0,0 +1,60 @@
[Unit]
Description=Bees (%i)
Documentation=https://github.com/Zygo/bees
After=sysinit.target
[Service]
Type=simple
ExecStart=@PREFIX@/sbin/beesd --no-timestamps %i
CPUAccounting=true
CPUSchedulingPolicy=batch
CPUWeight=12
IOSchedulingClass=idle
IOSchedulingPriority=7
IOWeight=10
KillMode=control-group
KillSignal=SIGTERM
MemoryAccounting=true
Nice=19
Restart=on-abnormal
RuntimeDirectoryMode=0700
RuntimeDirectory=bees
StartupCPUWeight=25
StartupIOWeight=25
# Hide other users' process in /proc/
ProtectProc=invisible
# Mount / as read-only
ProtectSystem=strict
# Forbidden access to /home, /root and /run/user
ProtectHome=true
# Mount tmpfs on /tmp/ and /var/tmp/.
# Cannot mount at /run/ or /var/run/ for they are used by systemd.
PrivateTmp=true
# Disable network access
PrivateNetwork=true
# Use private IPC namespace, utc namespace
PrivateIPC=true
ProtectHostname=true
# Disable write access to kernel variables throug /proc
ProtectKernelTunables=true
# Disable access to control groups
ProtectControlGroups=true
# Set capabilities of the new program
# The first three are required for accessing any file on the mounted filesystem.
# The last one is required for mounting the filesystem.
AmbientCapabilities=CAP_DAC_OVERRIDE CAP_DAC_READ_SEARCH CAP_FOWNER CAP_SYS_ADMIN
# With NoNewPrivileges, running sudo cannot gain any new privilege
NoNewPrivileges=true
[Install]
WantedBy=basic.target

3
src/.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
*.new.c
bees-usage.c
bees-version.[ch]

View File

@@ -1,27 +1,12 @@
PROGRAMS = \
../bin/bees \
../bin/fiemap \
../bin/fiewalk \
BEES = ../bin/bees
all: $(PROGRAMS) depends.mk
all: $(BEES)
include ../makeflags
-include ../localconf
LIBS = -lcrucible -lpthread
LDFLAGS = -L../lib -Wl,-rpath=$(shell realpath ../lib)
depends.mk: Makefile *.cc
for x in *.cc; do $(CXX) $(CXXFLAGS) -M "$$x"; done > depends.mk.new
mv -fv depends.mk.new depends.mk
-include depends.mk
%.o: %.cc %.h
$(CXX) $(CXXFLAGS) -o "$@" -c "$<"
../bin/%: %.o
@echo Implicit bin rule "$<" '->' "$@"
$(CXX) $(CXXFLAGS) -o "$@" "$<" $(LDFLAGS) $(LIBS)
BEES_LDFLAGS = -L../lib $(LDFLAGS)
BEES_OBJS = \
bees.o \
@@ -30,10 +15,32 @@ BEES_OBJS = \
bees-resolve.o \
bees-roots.o \
bees-thread.o \
bees-trace.o \
bees-types.o \
../bin/bees: $(BEES_OBJS)
$(CXX) $(CXXFLAGS) -o "$@" $(BEES_OBJS) $(LDFLAGS) $(LIBS)
ALL_OBJS = $(BEES_OBJS) $(PROGRAM_OBJS)
bees-version.c: bees.h $(BEES_OBJS:.o=.cc) Makefile ../lib/libcrucible.a
echo "const char *BEES_VERSION = \"$(BEES_VERSION)\";" > bees-version.c.new
if ! [ -e "$@" ] || ! cmp -s "$@.new" "$@"; then mv -fv $@.new $@; fi
bees-usage.c: bees-usage.txt Makefile
(echo 'const char *BEES_USAGE = '; sed -r 's/^(.*)$$/"\1\\n"/' < bees-usage.txt; echo ';') > bees-usage.new.c
mv -f bees-usage.new.c bees-usage.c
%.dep: %.cc Makefile
$(CXX) $(BEES_CXXFLAGS) -M -MF $@ -MT $(<:.cc=.o) $<
include $(ALL_OBJS:%.o=%.dep)
%.o: %.c ../makeflags
$(CC) $(BEES_CFLAGS) -o $@ -c $<
%.o: %.cc ../makeflags
$(CXX) $(BEES_CXXFLAGS) -o $@ -c $<
$(BEES): $(BEES_OBJS) bees-version.o bees-usage.o ../lib/libcrucible.a
$(CXX) $(BEES_CXXFLAGS) $(BEES_LDFLAGS) -o $@ $^ $(LIBS)
clean:
-rm -fv *.o
rm -fv *.o bees-version.c

File diff suppressed because it is too large Load Diff

View File

@@ -1,21 +1,21 @@
#include "bees.h"
#include "crucible/city.h"
#include "crucible/crc64.h"
#include "crucible/string.h"
#include "crucible/uname.h"
#include <algorithm>
#include <random>
#include <sys/mman.h>
using namespace crucible;
using namespace std;
static inline
bool
using_any_madvise()
BeesHash::BeesHash(const uint8_t *ptr, size_t len) :
// m_hash(CityHash64(reinterpret_cast<const char *>(ptr), len))
m_hash(Digest::CRC::crc64(ptr, len))
{
return true;
}
ostream &
@@ -31,16 +31,18 @@ operator<<(ostream &os, const BeesHashTable::Cell &bhte)
<< BeesAddress(bhte.e_addr) << " }";
}
#if 0
static
void
dump_bucket(BeesHashTable::Cell *p, BeesHashTable::Cell *q)
dump_bucket_locked(BeesHashTable::Cell *p, BeesHashTable::Cell *q)
{
// Must be called while holding m_bucket_mutex
for (auto i = p; i < q; ++i) {
BEESLOG("Entry " << i - p << " " << *i);
}
}
#endif
const bool VERIFY_CLEARS_BUGS = false;
static const bool VERIFY_CLEARS_BUGS = false;
bool
verify_cell_range(BeesHashTable::Cell *p, BeesHashTable::Cell *q, bool clear_bugs = VERIFY_CLEARS_BUGS)
@@ -51,7 +53,7 @@ verify_cell_range(BeesHashTable::Cell *p, BeesHashTable::Cell *q, bool clear_bug
for (BeesHashTable::Cell *cell = p; cell < q; ++cell) {
if (cell->e_addr && cell->e_addr < 0x1000) {
BEESCOUNT(bug_hash_magic_addr);
BEESINFO("Bad hash table address hash " << to_hex(cell->e_hash) << " addr " << to_hex(cell->e_addr));
BEESLOGDEBUG("Bad hash table address hash " << to_hex(cell->e_hash) << " addr " << to_hex(cell->e_addr));
if (clear_bugs) {
cell->e_addr = 0;
cell->e_hash = 0;
@@ -60,8 +62,8 @@ verify_cell_range(BeesHashTable::Cell *p, BeesHashTable::Cell *q, bool clear_bug
}
if (cell->e_addr && !seen_it.insert(*cell).second) {
BEESCOUNT(bug_hash_duplicate_cell);
// BEESLOG("Duplicate hash table entry:\nthis = " << *cell << "\nold = " << *seen_it.find(*cell));
BEESINFO("Duplicate hash table entry: " << *cell);
// BEESLOGDEBUG("Duplicate hash table entry:\nthis = " << *cell << "\nold = " << *seen_it.find(*cell));
BEESLOGDEBUG("Duplicate hash table entry: " << *cell);
if (clear_bugs) {
cell->e_addr = 0;
cell->e_hash = 0;
@@ -98,69 +100,132 @@ BeesHashTable::get_extent_range(HashType hash)
return make_pair(bp, ep);
}
void
BeesHashTable::flush_dirty_extents()
bool
BeesHashTable::flush_dirty_extent(uint64_t extent_index)
{
if (using_shared_map()) return;
BEESNOTE("flushing extent #" << extent_index << " of " << m_extents << " extents");
auto lock = lock_extent_by_index(extent_index);
bool wrote_extent = false;
catch_all([&]() {
uint8_t *const dirty_extent = m_extent_ptr[extent_index].p_byte;
uint8_t *const dirty_extent_end = m_extent_ptr[extent_index + 1].p_byte;
const size_t dirty_extent_offset = dirty_extent - m_byte_ptr;
THROW_CHECK1(out_of_range, dirty_extent, dirty_extent >= m_byte_ptr);
THROW_CHECK1(out_of_range, dirty_extent_end, dirty_extent_end <= m_byte_ptr_end);
THROW_CHECK2(out_of_range, dirty_extent_end, dirty_extent, dirty_extent_end - dirty_extent == BLOCK_SIZE_HASHTAB_EXTENT);
BEESTOOLONG("pwrite(fd " << m_fd << " '" << name_fd(m_fd)<< "', length " << to_hex(dirty_extent_end - dirty_extent) << ", offset " << to_hex(dirty_extent - m_byte_ptr) << ")");
// Copy the extent because we might be stuck writing for a while
ByteVector extent_copy(dirty_extent, dirty_extent_end);
// Release the lock
lock.unlock();
// Write the extent (or not)
pwrite_or_die(m_fd, extent_copy, dirty_extent_offset);
BEESCOUNT(hash_extent_out);
// Nope, this causes a _dramatic_ loss of performance.
// const size_t dirty_extent_size = dirty_extent_end - dirty_extent;
// bees_unreadahead(m_fd, dirty_extent_offset, dirty_extent_size);
// Mark extent clean if write was successful
lock.lock();
m_extent_metadata.at(extent_index).m_dirty = false;
wrote_extent = true;
});
return wrote_extent;
}
size_t
BeesHashTable::flush_dirty_extents(bool slowly)
{
THROW_CHECK1(runtime_error, m_buckets, m_buckets > 0);
unique_lock<mutex> lock(m_extent_mutex);
auto dirty_extent_copy = m_buckets_dirty;
m_buckets_dirty.clear();
if (dirty_extent_copy.empty()) {
BEESNOTE("idle");
m_condvar.wait(lock);
return; // please call later, i.e. immediately
}
lock.unlock();
uint64_t wrote_extents = 0;
for (size_t extent_index = 0; extent_index < m_extents; ++extent_index) {
// Skip the clean ones
auto lock = lock_extent_by_index(extent_index);
if (!m_extent_metadata.at(extent_index).m_dirty) {
continue;
}
lock.unlock();
size_t extent_counter = 0;
for (auto extent_number : dirty_extent_copy) {
++extent_counter;
BEESNOTE("flush extent #" << extent_number << " (" << extent_counter << " of " << dirty_extent_copy.size() << ")");
catch_all([&]() {
uint8_t *dirty_extent = m_extent_ptr[extent_number].p_byte;
uint8_t *dirty_extent_end = m_extent_ptr[extent_number + 1].p_byte;
THROW_CHECK1(out_of_range, dirty_extent, dirty_extent >= m_byte_ptr);
THROW_CHECK1(out_of_range, dirty_extent_end, dirty_extent_end <= m_byte_ptr_end);
if (using_shared_map()) {
BEESTOOLONG("flush extent " << extent_number);
copy(dirty_extent, dirty_extent_end, dirty_extent);
} else {
BEESTOOLONG("pwrite(fd " << m_fd << " '" << name_fd(m_fd)<< "', length " << to_hex(dirty_extent_end - dirty_extent) << ", offset " << to_hex(dirty_extent - m_byte_ptr) << ")");
// Page locks slow us down more than copying the data does
vector<uint8_t> extent_copy(dirty_extent, dirty_extent_end);
pwrite_or_die(m_fd, extent_copy, dirty_extent - m_byte_ptr);
BEESCOUNT(hash_extent_out);
if (flush_dirty_extent(extent_index)) {
++wrote_extents;
if (slowly) {
if (m_stop_requested) {
slowly = false;
continue;
}
BEESNOTE("flush rate limited after extent #" << extent_index << " of " << m_extents << " extents");
chrono::duration<double> sleep_time(m_flush_rate_limit.sleep_time(BLOCK_SIZE_HASHTAB_EXTENT));
unique_lock<mutex> lock(m_stop_mutex);
m_stop_condvar.wait_for(lock, sleep_time);
}
});
BEESNOTE("flush rate limited at extent #" << extent_number << " (" << extent_counter << " of " << dirty_extent_copy.size() << ")");
m_flush_rate_limit.sleep_for(BLOCK_SIZE_HASHTAB_EXTENT);
}
}
BEESLOGINFO("Flushed " << wrote_extents << " of " << m_extents << " hash table extents");
return wrote_extents;
}
void
BeesHashTable::set_extent_dirty(HashType hash)
BeesHashTable::set_extent_dirty_locked(uint64_t extent_index)
{
if (using_shared_map()) return;
THROW_CHECK1(runtime_error, m_buckets, m_buckets > 0);
auto pr = get_extent_range(hash);
uint64_t extent_number = reinterpret_cast<Extent *>(pr.first) - m_extent_ptr;
THROW_CHECK1(runtime_error, extent_number, extent_number < m_extents);
unique_lock<mutex> lock(m_extent_mutex);
m_buckets_dirty.insert(extent_number);
m_condvar.notify_one();
// Must already be locked
m_extent_metadata.at(extent_index).m_dirty = true;
// Signal writeback thread
unique_lock<mutex> dirty_lock(m_dirty_mutex);
m_dirty = true;
m_dirty_condvar.notify_one();
}
void
BeesHashTable::writeback_loop()
{
if (!using_shared_map()) {
while (1) {
flush_dirty_extents();
while (!m_stop_requested) {
auto wrote_extents = flush_dirty_extents(true);
BEESNOTE("idle after writing " << wrote_extents << " of " << m_extents << " extents");
unique_lock<mutex> lock(m_dirty_mutex);
if (m_stop_requested) {
break;
}
if (m_dirty) {
m_dirty = false;
} else {
m_dirty_condvar.wait(lock);
}
}
// The normal loop exits at the end of one iteration when stop requested,
// but stop request will be in the middle of the loop, and some extents
// will still be dirty. Run the flush loop again to get those.
BEESNOTE("flushing hash table, round 2");
BEESLOGDEBUG("Flushing hash table");
flush_dirty_extents(false);
// If there were any Tasks still running, they may have updated
// some hash table pages during the second flush. These updates
// will be lost. The Tasks will be repeated on the next run because
// they were not completed prior to the stop request, and the
// Crawl progress was already flushed out before the Hash table
// started writing, so nothing is really lost here.
catch_all([&]() {
// trigger writeback on our way out
#if 0
// seems to trigger huge latency spikes
BEESTOOLONG("unreadahead hash table size " <<
pretty(m_size)); bees_unreadahead(m_fd, 0, m_size);
#endif
});
BEESLOGDEBUG("Exited hash table writeback_loop");
}
static
@@ -177,14 +242,9 @@ percent(size_t num, size_t den)
void
BeesHashTable::prefetch_loop()
{
// Always do the mlock, whether shared or not
THROW_CHECK1(runtime_error, m_size, m_size > 0);
catch_all([&]() {
BEESNOTE("mlock " << pretty(m_size));
DIE_IF_NON_ZERO(mlock(m_byte_ptr, m_size));
});
while (1) {
Uname uname;
bool not_locked = true;
while (!m_stop_requested) {
size_t width = 64;
vector<size_t> occupancy(width, 0);
size_t occupied_count = 0;
@@ -194,14 +254,15 @@ BeesHashTable::prefetch_loop()
size_t toxic_count = 0;
size_t unaligned_eof_count = 0;
for (uint64_t ext = 0; ext < m_extents; ++ext) {
BEESNOTE("prefetching hash table extent " << ext << " of " << m_extent_ptr_end - m_extent_ptr);
m_prefetch_running = true;
for (uint64_t ext = 0; ext < m_extents && !m_stop_requested; ++ext) {
BEESNOTE("prefetching hash table extent #" << ext << " of " << m_extents);
catch_all([&]() {
fetch_missing_extent(ext * c_buckets_per_extent);
fetch_missing_extent_by_index(ext);
BEESNOTE("analyzing hash table extent " << ext << " of " << m_extent_ptr_end - m_extent_ptr);
BEESNOTE("analyzing hash table extent #" << ext << " of " << m_extents);
bool duplicate_bugs_found = false;
unique_lock<mutex> lock(m_bucket_mutex);
auto lock = lock_extent_by_index(ext);
for (Bucket *bucket = m_extent_ptr[ext].p_buckets; bucket < m_extent_ptr[ext + 1].p_buckets; ++bucket) {
if (verify_cell_range(bucket[0].p_cells, bucket[1].p_cells)) {
duplicate_bugs_found = true;
@@ -230,12 +291,12 @@ BeesHashTable::prefetch_loop()
// Count these instead of calculating the number so we get better stats in case of exceptions
occupied_count += this_bucket_occupied_count;
}
lock.unlock();
if (duplicate_bugs_found) {
set_extent_dirty(ext);
set_extent_dirty_locked(ext);
}
});
}
m_prefetch_running = false;
BEESNOTE("calculating hash table statistics");
@@ -268,20 +329,19 @@ BeesHashTable::prefetch_loop()
out << "\n";
}
size_t uncompressed_count = occupied_count - compressed_count;
size_t legacy_count = compressed_count - compressed_offset_count;
size_t uncompressed_count = occupied_count - compressed_offset_count;
ostringstream graph_blob;
graph_blob << "Now: " << format_time(time(NULL)) << "\n";
graph_blob << "Uptime: " << m_ctx->total_timer().age() << " seconds\n";
graph_blob << "Version: " << BEES_VERSION << "\n";
graph_blob << "Kernel: " << uname.sysname << " " << uname.release << " " << uname.machine << " " << uname.version << "\n";
graph_blob
<< "\nHash table page occupancy histogram (" << occupied_count << "/" << total_count << " cells occupied, " << (occupied_count * 100 / total_count) << "%)\n"
graph_blob
<< "\nHash table page occupancy histogram (" << occupied_count << "/" << total_count << " cells occupied, " << (occupied_count * 100 / total_count) << "%)\n"
<< out.str() << "0% | 25% | 50% | 75% | 100% page fill\n"
<< "compressed " << compressed_count << " (" << percent(compressed_count, occupied_count) << ")"
<< " new-style " << compressed_offset_count << " (" << percent(compressed_offset_count, occupied_count) << ")"
<< " old-style " << legacy_count << " (" << percent(legacy_count, occupied_count) << ")\n"
<< "compressed " << compressed_count << " (" << percent(compressed_count, occupied_count) << ")\n"
<< "uncompressed " << uncompressed_count << " (" << percent(uncompressed_count, occupied_count) << ")"
<< " unaligned_eof " << unaligned_eof_count << " (" << percent(unaligned_eof_count, occupied_count) << ")"
<< " toxic " << toxic_count << " (" << percent(toxic_count, occupied_count) << ")";
@@ -296,91 +356,148 @@ BeesHashTable::prefetch_loop()
auto avg_rates = thisStats / m_ctx->total_timer().age();
graph_blob << "\t" << avg_rates << "\n";
BEESLOG(graph_blob.str());
graph_blob << m_ctx->get_progress();
BEESLOGINFO(graph_blob.str());
catch_all([&]() {
m_stats_file.write(graph_blob.str());
});
if (not_locked && !m_stop_requested) {
// Always do the mlock, whether shared or not
THROW_CHECK1(runtime_error, m_size, m_size > 0);
BEESLOGINFO("mlock(" << pretty(m_size) << ")...");
Timer lock_time;
catch_all([&]() {
BEESNOTE("mlock " << pretty(m_size));
DIE_IF_NON_ZERO(mlock(m_byte_ptr, m_size));
});
BEESLOGINFO("mlock(" << pretty(m_size) << ") done in " << lock_time << " sec");
not_locked = false;
}
BEESNOTE("idle " << BEES_HASH_TABLE_ANALYZE_INTERVAL << "s");
nanosleep(BEES_HASH_TABLE_ANALYZE_INTERVAL);
unique_lock<mutex> lock(m_stop_mutex);
if (m_stop_requested) {
BEESLOGDEBUG("Stop requested in hash table prefetch");
return;
}
m_stop_condvar.wait_for(lock, chrono::duration<double>(BEES_HASH_TABLE_ANALYZE_INTERVAL));
}
}
size_t
BeesHashTable::hash_to_extent_index(HashType hash)
{
auto pr = get_extent_range(hash);
uint64_t extent_index = reinterpret_cast<const Extent *>(pr.first) - m_extent_ptr;
THROW_CHECK2(runtime_error, extent_index, m_extents, extent_index < m_extents);
return extent_index;
}
BeesHashTable::ExtentMetaData::ExtentMetaData() :
m_mutex_ptr(make_shared<mutex>())
{
}
unique_lock<mutex>
BeesHashTable::lock_extent_by_index(uint64_t extent_index)
{
THROW_CHECK2(out_of_range, extent_index, m_extents, extent_index < m_extents);
return unique_lock<mutex>(*m_extent_metadata.at(extent_index).m_mutex_ptr);
}
unique_lock<mutex>
BeesHashTable::lock_extent_by_hash(HashType hash)
{
BEESTOOLONG("fetch_missing_extent for hash " << to_hex(hash));
return lock_extent_by_index(hash_to_extent_index(hash));
}
void
BeesHashTable::fetch_missing_extent_by_index(uint64_t extent_index)
{
BEESNOTE("checking hash extent #" << extent_index << " of " << m_extents << " extents");
auto lock = lock_extent_by_index(extent_index);
if (!m_extent_metadata.at(extent_index).m_missing) {
return;
}
// OK we have to read this extent
BEESNOTE("fetching hash extent #" << extent_index << " of " << m_extents << " extents");
BEESTRACE("Fetching hash extent #" << extent_index << " of " << m_extents << " extents");
BEESTOOLONG("Fetching hash extent #" << extent_index << " of " << m_extents << " extents");
uint8_t *const dirty_extent = m_extent_ptr[extent_index].p_byte;
uint8_t *const dirty_extent_end = m_extent_ptr[extent_index + 1].p_byte;
const size_t dirty_extent_size = dirty_extent_end - dirty_extent;
const size_t dirty_extent_offset = dirty_extent - m_byte_ptr;
// If the read fails don't retry, just go with whatever data we have
m_extent_metadata.at(extent_index).m_missing = false;
catch_all([&]() {
BEESTOOLONG("pread(fd " << m_fd << " '" << name_fd(m_fd)<< "', length " << to_hex(dirty_extent_end - dirty_extent) << ", offset " << to_hex(dirty_extent - m_byte_ptr) << ")");
pread_or_die(m_fd, dirty_extent, dirty_extent_size, dirty_extent_offset);
// Only count extents successfully read
BEESCOUNT(hash_extent_in);
// Won't need that again
bees_unreadahead(m_fd, dirty_extent_offset, dirty_extent_size);
// If we are in prefetch, give the kernel a hint about the next extent
if (m_prefetch_running) {
// Use the kernel readahead here, because it might work for this use case
readahead(m_fd, dirty_extent_offset + dirty_extent_size, dirty_extent_size);
}
});
Cell *cell = m_extent_ptr[extent_index ].p_buckets[0].p_cells;
Cell *cell_end = m_extent_ptr[extent_index + 1].p_buckets[0].p_cells;
size_t toxic_cleared_count = 0;
set<BeesHashTable::Cell> seen_it(cell, cell_end);
while (cell < cell_end) {
if (cell->e_addr & BeesAddress::c_toxic_mask) {
++toxic_cleared_count;
cell->e_addr &= ~BeesAddress::c_toxic_mask;
// Clearing the toxic bit might mean we now have a duplicate.
// This could be due to a race between two
// inserts, one finds the extent toxic while the
// other does not. That's arguably a bug elsewhere,
// but we should rewrite the whole extent lookup/insert
// loop, not spend time fixing code that will be
// thrown out later anyway.
// If there is a cell that is identical to this one
// except for the toxic bit, then we don't need this one.
if (seen_it.count(*cell)) {
cell->e_addr = 0;
cell->e_hash = 0;
}
}
++cell;
}
if (toxic_cleared_count) {
BEESLOGDEBUG("Cleared " << toxic_cleared_count << " hashes while fetching hash table extent " << extent_index);
}
}
void
BeesHashTable::fetch_missing_extent(HashType hash)
BeesHashTable::fetch_missing_extent_by_hash(HashType hash)
{
BEESTOOLONG("fetch_missing_extent for hash " << to_hex(hash));
if (using_shared_map()) return;
THROW_CHECK1(runtime_error, m_buckets, m_buckets > 0);
auto pr = get_extent_range(hash);
uint64_t extent_number = reinterpret_cast<Extent *>(pr.first) - m_extent_ptr;
THROW_CHECK1(runtime_error, extent_number, extent_number < m_extents);
uint64_t extent_index = hash_to_extent_index(hash);
BEESNOTE("waiting to fetch hash extent #" << extent_index << " of " << m_extents << " extents");
unique_lock<mutex> lock(m_extent_mutex);
if (!m_buckets_missing.count(extent_number)) {
return;
}
size_t missing_buckets = m_buckets_missing.size();
lock.unlock();
BEESNOTE("fetch waiting for hash extent #" << extent_number << ", " << missing_buckets << " left to fetch");
// Acquire blocking lock on this extent only
LockSet<uint64_t>::Lock extent_lock(m_extent_lock_set, extent_number);
// Check missing again because someone else might have fetched this
// extent for us while we didn't hold any locks
lock.lock();
if (!m_buckets_missing.count(extent_number)) {
BEESCOUNT(hash_extent_in_twice);
return;
}
lock.unlock();
// OK we have to read this extent
BEESNOTE("fetching hash extent #" << extent_number << ", " << missing_buckets << " left to fetch");
BEESTRACE("Fetching missing hash extent " << extent_number);
uint8_t *dirty_extent = m_extent_ptr[extent_number].p_byte;
uint8_t *dirty_extent_end = m_extent_ptr[extent_number + 1].p_byte;
{
BEESTOOLONG("pread(fd " << m_fd << " '" << name_fd(m_fd)<< "', length " << to_hex(dirty_extent_end - dirty_extent) << ", offset " << to_hex(dirty_extent - m_byte_ptr) << ")");
pread_or_die(m_fd, dirty_extent, dirty_extent_end - dirty_extent, dirty_extent - m_byte_ptr);
}
BEESCOUNT(hash_extent_in);
// We don't block when fetching an extent but we do slow down the
// prefetch thread.
m_prefetch_rate_limit.borrow(BLOCK_SIZE_HASHTAB_EXTENT);
lock.lock();
m_buckets_missing.erase(extent_number);
}
bool
BeesHashTable::is_toxic_hash(BeesHashTable::HashType hash) const
{
return m_toxic_hashes.find(hash) != m_toxic_hashes.end();
fetch_missing_extent_by_index(extent_index);
}
vector<BeesHashTable::Cell>
BeesHashTable::find_cell(HashType hash)
{
// This saves a lot of time prefilling the hash table, and there's no risk of eviction
if (is_toxic_hash(hash)) {
BEESCOUNT(hash_toxic);
BeesAddress toxic_addr(0x1000);
toxic_addr.set_toxic();
Cell toxic_cell(hash, toxic_addr);
vector<Cell> rv;
rv.push_back(toxic_cell);
return rv;
}
fetch_missing_extent(hash);
fetch_missing_extent_by_hash(hash);
BEESTOOLONG("find_cell hash " << BeesHash(hash));
vector<Cell> rv;
unique_lock<mutex> lock(m_bucket_mutex);
auto lock = lock_extent_by_hash(hash);
auto er = get_cell_range(hash);
// FIXME: Weed out zero addresses in the table due to earlier bugs
copy_if(er.first, er.second, back_inserter(rv), [=](const Cell &ip) { return ip.e_hash == hash && ip.e_addr >= 0x1000; });
@@ -388,46 +505,45 @@ BeesHashTable::find_cell(HashType hash)
return rv;
}
// Move an entry to the end of the list. Used after an attempt to resolve
// an address in the hash table fails. Probably more correctly called
// push_back_hash_addr, except it never inserts. Shared hash tables
// never erase anything, since there is no way to tell if an entry is
// out of date or just belonging to the wrong filesystem.
/// Remove a hash from the table, leaving an empty space on the list
/// where the hash used to be. Used when an invalid address is found
/// because lookups on invalid addresses really hurt.
void
BeesHashTable::erase_hash_addr(HashType hash, AddrType addr)
{
// if (m_shared) return;
fetch_missing_extent(hash);
fetch_missing_extent_by_hash(hash);
BEESTOOLONG("erase hash " << to_hex(hash) << " addr " << addr);
unique_lock<mutex> lock(m_bucket_mutex);
auto lock = lock_extent_by_hash(hash);
auto er = get_cell_range(hash);
Cell mv(hash, addr);
Cell *ip = find(er.first, er.second, mv);
bool found = (ip < er.second);
if (found) {
// Lookups on invalid addresses really hurt us. Kill it with fire!
*ip = Cell(0, 0);
set_extent_dirty(hash);
set_extent_dirty_locked(hash_to_extent_index(hash));
BEESCOUNT(hash_erase);
#if 0
if (verify_cell_range(er.first, er.second)) {
BEESINFO("while erasing hash " << hash << " addr " << addr);
BEESLOGDEBUG("while erasing hash " << hash << " addr " << addr);
}
#endif
} else {
BEESCOUNT(hash_erase_miss);
}
}
// If entry is already present in list, move it to the front of the
// list without dropping any entries, and return true. If entry is not
// present in list, insert it at the front of the list, possibly dropping
// the last entry in the list, and return false. Used to move duplicate
// hash blocks to the front of the list.
/// Insert a hash entry at the head of the list. If entry is already
/// present in list, move it to the front of the list without dropping
/// any entries, and return true. If entry is not present in list,
/// insert it at the front of the list, possibly dropping the last entry
/// in the list, and return false. Used to move duplicate hash blocks
/// to the front of the list.
bool
BeesHashTable::push_front_hash_addr(HashType hash, AddrType addr)
{
fetch_missing_extent(hash);
fetch_missing_extent_by_hash(hash);
BEESTOOLONG("push_front_hash_addr hash " << BeesHash(hash) <<" addr " << BeesAddress(addr));
unique_lock<mutex> lock(m_bucket_mutex);
auto lock = lock_extent_by_hash(hash);
auto er = get_cell_range(hash);
Cell mv(hash, addr);
Cell *ip = find(er.first, er.second, mv);
@@ -445,7 +561,7 @@ BeesHashTable::push_front_hash_addr(HashType hash, AddrType addr)
auto dp = ip;
--sp;
// If we are deleting the last entry then don't copy it
if (ip == er.second) {
if (dp == er.second) {
--sp;
--dp;
BEESCOUNT(hash_evict);
@@ -457,39 +573,44 @@ BeesHashTable::push_front_hash_addr(HashType hash, AddrType addr)
// There is now a space at the front, insert there if different
if (er.first[0] != mv) {
er.first[0] = mv;
set_extent_dirty(hash);
set_extent_dirty_locked(hash_to_extent_index(hash));
BEESCOUNT(hash_front);
} else {
BEESCOUNT(hash_front_already);
}
#if 0
if (verify_cell_range(er.first, er.second)) {
BEESINFO("while push_fronting hash " << hash << " addr " << addr);
BEESLOGDEBUG("while push_fronting hash " << hash << " addr " << addr);
}
#endif
return found;
}
// If entry is already present in list, returns true and does not
// modify list. If entry is not present in list, returns false and
// inserts at a random position in the list, possibly evicting the entry
// at the end of the list. Used to insert new unique (not-yet-duplicate)
// blocks in random order.
thread_local uniform_int_distribution<size_t> BeesHashTable::tl_distribution(0, c_cells_per_bucket - 1);
/// Insert a hash entry at some unspecified point in the list.
/// If entry is already present in list, returns true and does not
/// modify list. If entry is not present in list, returns false and
/// inserts at a random position in the list, possibly evicting the entry
/// at the end of the list. Used to insert new unique (not-yet-duplicate)
/// blocks in random order.
bool
BeesHashTable::push_random_hash_addr(HashType hash, AddrType addr)
{
fetch_missing_extent(hash);
fetch_missing_extent_by_hash(hash);
BEESTOOLONG("push_random_hash_addr hash " << BeesHash(hash) << " addr " << BeesAddress(addr));
unique_lock<mutex> lock(m_bucket_mutex);
auto lock = lock_extent_by_hash(hash);
auto er = get_cell_range(hash);
Cell mv(hash, addr);
Cell *ip = find(er.first, er.second, mv);
bool found = (ip < er.second);
thread_local default_random_engine generator;
thread_local uniform_int_distribution<int> distribution(0, c_cells_per_bucket - 1);
auto pos = distribution(generator);
const auto pos = tl_distribution(bees_generator);
int case_cond = 0;
#if 0
vector<Cell> saved(er.first, er.second);
#endif
if (found) {
// If hash already exists after pos, swap with pos
@@ -535,20 +656,25 @@ BeesHashTable::push_random_hash_addr(HashType hash, AddrType addr)
}
// Evict something and insert at pos
move_backward(er.first + pos, er.second - 1, er.second);
// move_backward(er.first + pos, er.second - 1, er.second);
ip = er.second - 1;
while (ip > er.first + pos) {
auto dp = ip;
*dp = *--ip;
}
er.first[pos] = mv;
BEESCOUNT(hash_evict);
case_cond = 5;
ret_dirty:
BEESCOUNT(hash_insert);
set_extent_dirty(hash);
set_extent_dirty_locked(hash_to_extent_index(hash));
ret:
#if 0
if (verify_cell_range(er.first, er.second, false)) {
BEESLOG("while push_randoming (case " << case_cond << ") pos " << pos
<< " ip " << (ip - er.first) << " " << mv);
// dump_bucket(saved.data(), saved.data() + saved.size());
// dump_bucket(er.first, er.second);
// dump_bucket_locked(saved.data(), saved.data() + saved.size());
// dump_bucket_locked(er.first, er.second);
}
#else
(void)case_cond;
@@ -563,9 +689,9 @@ BeesHashTable::try_mmap_flags(int flags)
THROW_CHECK1(out_of_range, m_size, m_size > 0);
Timer map_time;
catch_all([&]() {
BEESLOG("mapping hash table size " << m_size << " with flags " << mmap_flags_ntoa(flags));
BEESLOGINFO("mapping hash table size " << m_size << " with flags " << mmap_flags_ntoa(flags));
void *ptr = mmap_or_die(nullptr, m_size, PROT_READ | PROT_WRITE, flags, flags & MAP_ANONYMOUS ? -1 : int(m_fd), 0);
BEESLOG("mmap done in " << map_time << " sec");
BEESLOGINFO("mmap done in " << map_time << " sec");
m_cell_ptr = static_cast<Cell *>(ptr);
void *ptr_end = static_cast<uint8_t *>(ptr) + m_size;
m_cell_ptr_end = static_cast<Cell *>(ptr_end);
@@ -574,12 +700,39 @@ BeesHashTable::try_mmap_flags(int flags)
}
void
BeesHashTable::set_shared(bool shared)
BeesHashTable::open_file()
{
m_shared = shared;
// OK open hash table
BEESNOTE("opening hash table '" << m_filename << "' target size " << m_size << " (" << pretty(m_size) << ")");
// Try to open existing hash table
Fd new_fd = openat(m_ctx->home_fd(), m_filename.c_str(), FLAGS_OPEN_FILE_RW, 0700);
// If that doesn't work, try to make a new one
if (!new_fd) {
string tmp_filename = m_filename + ".tmp";
BEESNOTE("creating new hash table '" << tmp_filename << "'");
BEESLOGINFO("Creating new hash table '" << tmp_filename << "'");
unlinkat(m_ctx->home_fd(), tmp_filename.c_str(), 0);
new_fd = openat_or_die(m_ctx->home_fd(), tmp_filename, FLAGS_CREATE_FILE, 0700);
BEESNOTE("truncating new hash table '" << tmp_filename << "' size " << m_size << " (" << pretty(m_size) << ")");
BEESLOGINFO("Truncating new hash table '" << tmp_filename << "' size " << m_size << " (" << pretty(m_size) << ")");
ftruncate_or_die(new_fd, m_size);
BEESNOTE("truncating new hash table '" << tmp_filename << "' -> '" << m_filename << "'");
BEESLOGINFO("Truncating new hash table '" << tmp_filename << "' -> '" << m_filename << "'");
renameat_or_die(m_ctx->home_fd(), tmp_filename, m_ctx->home_fd(), m_filename);
}
Stat st(new_fd);
off_t new_size = st.st_size;
THROW_CHECK1(invalid_argument, new_size, new_size > 0);
THROW_CHECK1(invalid_argument, new_size, (new_size % BLOCK_SIZE_HASHTAB_EXTENT) == 0);
m_size = new_size;
m_fd = new_fd;
}
BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename) :
BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t size) :
m_ctx(ctx),
m_size(0),
m_void_ptr(nullptr),
@@ -587,66 +740,69 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename) :
m_buckets(0),
m_cells(0),
m_writeback_thread("hash_writeback"),
m_prefetch_thread("hash_prefetch " + m_ctx->root_path()),
m_prefetch_thread("hash_prefetch"),
m_flush_rate_limit(BEES_FLUSH_RATE),
m_prefetch_rate_limit(BEES_FLUSH_RATE),
m_stats_file(m_ctx->home_fd(), "beesstats.txt")
{
BEESNOTE("opening hash table " << filename);
m_fd = openat_or_die(m_ctx->home_fd(), filename, FLAGS_OPEN_FILE_RW, 0700);
Stat st(m_fd);
m_size = st.st_size;
BEESTRACE("hash table size " << m_size);
BEESTRACE("hash table bucket size " << BLOCK_SIZE_HASHTAB_BUCKET);
BEESTRACE("hash table extent size " << BLOCK_SIZE_HASHTAB_EXTENT);
// Sanity checks to protect the implementation from its weaknesses
THROW_CHECK2(invalid_argument, BLOCK_SIZE_HASHTAB_BUCKET, BLOCK_SIZE_HASHTAB_EXTENT, (BLOCK_SIZE_HASHTAB_EXTENT % BLOCK_SIZE_HASHTAB_BUCKET) == 0);
// Does the union work?
THROW_CHECK2(runtime_error, m_void_ptr, m_cell_ptr, m_void_ptr == m_cell_ptr);
THROW_CHECK2(runtime_error, m_void_ptr, m_byte_ptr, m_void_ptr == m_byte_ptr);
THROW_CHECK2(runtime_error, m_void_ptr, m_bucket_ptr, m_void_ptr == m_bucket_ptr);
THROW_CHECK2(runtime_error, m_void_ptr, m_extent_ptr, m_void_ptr == m_extent_ptr);
// There's more than one union
THROW_CHECK2(runtime_error, sizeof(Bucket), BLOCK_SIZE_HASHTAB_BUCKET, BLOCK_SIZE_HASHTAB_BUCKET == sizeof(Bucket));
THROW_CHECK2(runtime_error, sizeof(Bucket::p_byte), BLOCK_SIZE_HASHTAB_BUCKET, BLOCK_SIZE_HASHTAB_BUCKET == sizeof(Bucket::p_byte));
THROW_CHECK2(runtime_error, sizeof(Extent), BLOCK_SIZE_HASHTAB_EXTENT, BLOCK_SIZE_HASHTAB_EXTENT == sizeof(Extent));
THROW_CHECK2(runtime_error, sizeof(Extent::p_byte), BLOCK_SIZE_HASHTAB_EXTENT, BLOCK_SIZE_HASHTAB_EXTENT == sizeof(Extent::p_byte));
BEESLOG("opened hash table filename '" << filename << "' length " << m_size);
m_filename = filename;
m_size = size;
open_file();
// Now we know size we can compute stuff
BEESTRACE("hash table size " << m_size);
BEESTRACE("hash table bucket size " << BLOCK_SIZE_HASHTAB_BUCKET);
BEESTRACE("hash table extent size " << BLOCK_SIZE_HASHTAB_EXTENT);
BEESLOGINFO("opened hash table filename '" << filename << "' length " << m_size);
m_buckets = m_size / BLOCK_SIZE_HASHTAB_BUCKET;
m_cells = m_buckets * c_cells_per_bucket;
m_extents = (m_size + BLOCK_SIZE_HASHTAB_EXTENT - 1) / BLOCK_SIZE_HASHTAB_EXTENT;
BEESLOG("\tcells " << m_cells << ", buckets " << m_buckets << ", extents " << m_extents);
BEESLOGINFO("\tcells " << m_cells << ", buckets " << m_buckets << ", extents " << m_extents);
BEESLOG("\tflush rate limit " << BEES_FLUSH_RATE);
BEESLOGINFO("\tflush rate limit " << BEES_FLUSH_RATE);
if (using_shared_map()) {
try_mmap_flags(MAP_SHARED);
} else {
try_mmap_flags(MAP_PRIVATE | MAP_ANONYMOUS);
}
// Try to mmap that much memory
try_mmap_flags(MAP_PRIVATE | MAP_ANONYMOUS);
if (!m_cell_ptr) {
THROW_ERROR(runtime_error, "unable to mmap " << filename);
THROW_ERRNO("unable to mmap " << filename);
}
if (!using_shared_map()) {
// madvise fails if MAP_SHARED
if (using_any_madvise()) {
// DONTFORK because we sometimes do fork,
// but the child doesn't touch any of the many, many pages
BEESTOOLONG("madvise(MADV_HUGEPAGE | MADV_DONTFORK)");
DIE_IF_NON_ZERO(madvise(m_byte_ptr, m_size, MADV_HUGEPAGE | MADV_DONTFORK));
}
for (uint64_t i = 0; i < m_size / sizeof(Extent); ++i) {
m_buckets_missing.insert(i);
// Do unions work the way we think (and rely on)?
THROW_CHECK2(runtime_error, m_void_ptr, m_cell_ptr, m_void_ptr == m_cell_ptr);
THROW_CHECK2(runtime_error, m_void_ptr, m_byte_ptr, m_void_ptr == m_byte_ptr);
THROW_CHECK2(runtime_error, m_void_ptr, m_bucket_ptr, m_void_ptr == m_bucket_ptr);
THROW_CHECK2(runtime_error, m_void_ptr, m_extent_ptr, m_void_ptr == m_extent_ptr);
// Give all the madvise hints that the kernel understands
const struct madv_flag {
const char *name;
int value;
} madv_flags[] = {
{ .name = "MADV_HUGEPAGE", .value = MADV_HUGEPAGE },
{ .name = "MADV_DONTFORK", .value = MADV_DONTFORK },
{ .name = "MADV_DONTDUMP", .value = MADV_DONTDUMP },
{ .name = "", .value = 0 },
};
for (auto fp = madv_flags; fp->value; ++fp) {
BEESTOOLONG("madvise(" << fp->name << ")");
if (madvise(m_byte_ptr, m_size, fp->value)) {
BEESLOGNOTICE("madvise(..., " << fp->name << "): " << strerror(errno) << " (ignored)");
}
}
m_extent_metadata.resize(m_extents);
m_writeback_thread.exec([&]() {
writeback_loop();
});
@@ -655,28 +811,69 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename) :
prefetch_loop();
});
// Blacklist might fail if the hash table is not stored on a btrfs
// Blacklist might fail if the hash table is not stored on a btrfs,
// or if it's on a _different_ btrfs
catch_all([&]() {
m_ctx->blacklist_add(BeesFileId(m_fd));
// Root is definitely a btrfs
BtrfsIoctlFsInfoArgs root_info;
root_info.do_ioctl(m_ctx->root_fd());
// Hash might not be a btrfs
BtrfsIoctlFsInfoArgs hash_info;
// If btrfs fs_info ioctl fails, it must be a different fs
if (!hash_info.do_ioctl_nothrow(m_fd)) return;
// If Hash is a btrfs, Root must be the same one
if (root_info.fsid() != hash_info.fsid()) return;
// Hash is on the same one, blacklist it
m_ctx->blacklist_insert(BeesFileId(m_fd));
});
// Skip zero because we already weed that out before it gets near a hash function
for (unsigned i = 1; i < 256; ++i) {
vector<uint8_t> v(BLOCK_SIZE_SUMS, i);
HashType hash = Digest::CRC::crc64(v.data(), v.size());
m_toxic_hashes.insert(hash);
}
}
BeesHashTable::~BeesHashTable()
{
BEESLOGDEBUG("Destroy BeesHashTable");
if (m_cell_ptr && m_size) {
flush_dirty_extents();
// Dirty extents should have been flushed before now,
// e.g. in stop(). If that didn't happen, don't fall
// into the same trap (and maybe throw an exception) here.
// flush_dirty_extents(false);
catch_all([&]() {
// drop the memory mapping
BEESTOOLONG("unmap handle table size " << pretty(m_size));
DIE_IF_NON_ZERO(munmap(m_cell_ptr, m_size));
m_cell_ptr = nullptr;
m_size = 0;
});
m_cell_ptr = nullptr;
m_size = 0;
}
BEESLOGDEBUG("BeesHashTable destroyed");
}
void
BeesHashTable::stop_request()
{
BEESNOTE("stopping BeesHashTable threads");
BEESLOGDEBUG("Stopping BeesHashTable threads");
unique_lock<mutex> lock(m_stop_mutex);
m_stop_requested = true;
m_stop_condvar.notify_all();
lock.unlock();
// Wake up hash writeback too
unique_lock<mutex> dirty_lock(m_dirty_mutex);
m_dirty_condvar.notify_all();
dirty_lock.unlock();
}
void
BeesHashTable::stop_wait()
{
BEESNOTE("waiting for hash_prefetch thread");
BEESLOGDEBUG("Waiting for hash_prefetch thread");
m_prefetch_thread.join();
BEESNOTE("waiting for hash_writeback thread");
BEESLOGDEBUG("Waiting for hash_writeback thread");
m_writeback_thread.join();
BEESLOGDEBUG("BeesHashTable stopped");
}

View File

@@ -98,90 +98,79 @@ BeesResolver::adjust_offset(const BeesFileRange &haystack, const BeesBlockData &
return BeesBlockData();
}
off_t lower_offset = haystack.begin();
off_t upper_offset = haystack.end();
off_t haystack_offset = haystack.begin();
bool is_compressed_offset = false;
bool is_exact = false;
bool is_legacy = false;
if (m_addr.is_compressed()) {
BtrfsExtentWalker ew(haystack.fd(), haystack.begin(), m_ctx->root_fd());
BEESTRACE("haystack extent data " << ew);
BEESTRACE("haystack extent data " << ew);
Extent e = ew.current();
if (m_addr.has_compressed_offset()) {
off_t coff = m_addr.get_compressed_offset();
if (e.offset() > coff) {
// this extent begins after the target block
BEESCOUNT(adjust_offset_low);
return BeesBlockData();
}
coff -= e.offset();
if (e.size() <= coff) {
// this extent ends before the target block
BEESCOUNT(adjust_offset_high);
return BeesBlockData();
}
lower_offset = e.begin() + coff;
upper_offset = lower_offset + BLOCK_SIZE_CLONE;
BEESCOUNT(adjust_offset_hit);
is_compressed_offset = true;
} else {
lower_offset = e.begin();
upper_offset = e.end();
BEESCOUNT(adjust_legacy);
is_legacy = true;
THROW_CHECK1(runtime_error, m_addr, m_addr.has_compressed_offset());
off_t coff = m_addr.get_compressed_offset();
if (e.offset() > coff) {
// this extent begins after the target block
BEESCOUNT(adjust_offset_low);
return BeesBlockData();
}
coff -= e.offset();
if (e.size() <= coff) {
// this extent ends before the target block
BEESCOUNT(adjust_offset_high);
return BeesBlockData();
}
haystack_offset = e.begin() + coff;
BEESCOUNT(adjust_offset_hit);
is_compressed_offset = true;
} else {
BEESCOUNT(adjust_exact);
is_exact = true;
}
BEESTRACE("Checking haystack " << haystack << " offsets " << to_hex(lower_offset) << ".." << to_hex(upper_offset));
BEESTRACE("Checking haystack " << haystack << " offset " << to_hex(haystack_offset));
// Check all the blocks in the list
for (off_t haystack_offset = lower_offset; haystack_offset < upper_offset; haystack_offset += BLOCK_SIZE_CLONE) {
THROW_CHECK1(out_of_range, haystack_offset, (haystack_offset & BLOCK_MASK_CLONE) == 0);
THROW_CHECK1(out_of_range, haystack_offset, (haystack_offset & BLOCK_MASK_CLONE) == 0);
// Straw cannot extend beyond end of haystack
if (haystack_offset + needle.size() > haystack_size) {
BEESCOUNT(adjust_needle_too_long);
break;
}
// Read the haystack
BEESTRACE("straw " << name_fd(haystack.fd()) << ", offset " << to_hex(haystack_offset) << ", length " << needle.size());
BeesBlockData straw(haystack.fd(), haystack_offset, needle.size());
BEESTRACE("straw = " << straw);
// Stop if we find a match
if (straw.is_data_equal(needle)) {
BEESCOUNT(adjust_hit);
m_found_data = true;
m_found_hash = true;
if (is_compressed_offset) BEESCOUNT(adjust_compressed_offset_correct);
if (is_legacy) BEESCOUNT(adjust_legacy_correct);
if (is_exact) BEESCOUNT(adjust_exact_correct);
return straw;
}
if (straw.hash() != needle.hash()) {
// Not the same hash or data, try next block
BEESCOUNT(adjust_miss);
continue;
}
// Found the hash but not the data. Yay!
m_found_hash = true;
BEESLOG("HASH COLLISION\n"
<< "\tneedle " << needle << "\n"
<< "\tstraw " << straw);
BEESCOUNT(hash_collision);
// Straw cannot extend beyond end of haystack
if (haystack_offset + needle.size() > haystack_size) {
BEESCOUNT(adjust_needle_too_long);
return BeesBlockData();
}
// Read the haystack
BEESTRACE("straw " << name_fd(haystack.fd()) << ", offset " << to_hex(haystack_offset) << ", length " << needle.size());
BeesBlockData straw(haystack.fd(), haystack_offset, needle.size());
BEESTRACE("straw = " << straw);
// Stop if we find a match
if (straw.is_data_equal(needle)) {
BEESCOUNT(adjust_hit);
m_found_data = true;
m_found_hash = true;
if (is_compressed_offset) BEESCOUNT(adjust_compressed_offset_correct);
if (is_exact) BEESCOUNT(adjust_exact_correct);
return straw;
}
if (straw.hash() != needle.hash()) {
// Not the same hash or data, try next block
BEESCOUNT(adjust_miss);
return BeesBlockData();
}
// Found the hash but not the data. Yay!
m_found_hash = true;
#if 0
BEESLOGINFO("HASH COLLISION\n"
<< "\tneedle " << needle << "\n"
<< "\tstraw " << straw);
#endif
BEESCOUNT(hash_collision);
// Ran out of offsets to try
BEESCOUNT(adjust_no_match);
if (is_compressed_offset) BEESCOUNT(adjust_compressed_offset_wrong);
if (is_legacy) BEESCOUNT(adjust_legacy_wrong);
if (is_exact) BEESCOUNT(adjust_exact_wrong);
m_wrong_data = true;
return BeesBlockData();
@@ -196,8 +185,8 @@ BeesResolver::chase_extent_ref(const BtrfsInodeOffsetRoot &bior, BeesBlockData &
Fd file_fd = m_ctx->roots()->open_root_ino(bior.m_root, bior.m_inum);
if (!file_fd) {
// Delete snapshots generate craptons of these
// BEESINFO("No FD in chase_extent_ref " << bior);
// Deleted snapshots generate craptons of these
// BEESLOGDEBUG("No FD in chase_extent_ref " << bior);
BEESCOUNT(chase_no_fd);
return BeesFileRange();
}
@@ -211,7 +200,7 @@ BeesResolver::chase_extent_ref(const BtrfsInodeOffsetRoot &bior, BeesBlockData &
// ...or are we?
if (file_addr.is_magic()) {
BEESINFO("file_addr is magic: file_addr = " << file_addr << " bior = " << bior << " needle_bbd = " << needle_bbd);
BEESLOGDEBUG("file_addr is magic: file_addr = " << file_addr << " bior = " << bior << " needle_bbd = " << needle_bbd);
BEESCOUNT(chase_wrong_magic);
return BeesFileRange();
}
@@ -220,7 +209,7 @@ BeesResolver::chase_extent_ref(const BtrfsInodeOffsetRoot &bior, BeesBlockData &
// Did we get the physical block we asked for? The magic bits have to match too,
// but the compressed offset bits do not.
if (file_addr.get_physical_or_zero() != m_addr.get_physical_or_zero()) {
// BEESINFO("found addr " << file_addr << " at " << name_fd(file_fd) << " offset " << to_hex(bior.m_offset) << " but looking for " << m_addr);
// BEESLOGDEBUG("found addr " << file_addr << " at " << name_fd(file_fd) << " offset " << to_hex(bior.m_offset) << " but looking for " << m_addr);
// FIEMAP/resolve are working, but the data is old.
BEESCOUNT(chase_wrong_addr);
return BeesFileRange();
@@ -240,10 +229,12 @@ BeesResolver::chase_extent_ref(const BtrfsInodeOffsetRoot &bior, BeesBlockData &
// Search near the resolved address for a matching data block.
// ...even if it's not compressed, we should do this sanity
// check before considering the block as a duplicate candidate.
// FIXME: this is mostly obsolete now and we shouldn't do it here.
// Don't bother fixing it because it will all go away with (extent, offset) reads.
auto new_bbd = adjust_offset(haystack_bbd, needle_bbd);
if (new_bbd.empty()) {
// matching offset search failed
BEESCOUNT(chase_wrong_data);
BEESCOUNT(chase_no_data);
return BeesFileRange();
}
if (new_bbd.begin() == haystack_bbd.begin()) {
@@ -368,7 +359,8 @@ BeesResolver::for_each_extent_ref(BeesBlockData bbd, function<bool(const BeesFil
}
// Look at the old data
catch_all([&]() {
// FIXME: propagate exceptions for now. Proper fix requires a rewrite.
// catch_all([&]() {
BEESTRACE("chase_extent_ref ino " << ino_off_root << " bbd " << bbd);
auto new_range = chase_extent_ref(ino_off_root, bbd);
// XXX: should we catch visitor's exceptions here?
@@ -378,9 +370,12 @@ BeesResolver::for_each_extent_ref(BeesBlockData bbd, function<bool(const BeesFil
// We have reliable block addresses now, so we guarantee we can hit the desired block.
// Failure in chase_extent_ref means we are done, and don't need to look up all the
// other references.
stop_now = true;
// Or...not? If we have a compressed extent, some refs will not match
// if there is are two references to the same extent with a reference
// to a different extent between them.
// stop_now = true;
}
});
// });
if (stop_now) {
break;
@@ -389,26 +384,29 @@ BeesResolver::for_each_extent_ref(BeesBlockData bbd, function<bool(const BeesFil
return stop_now;
}
BeesFileRange
BeesResolver::replace_dst(const BeesFileRange &dst_bfr)
BeesRangePair
BeesResolver::replace_dst(const BeesFileRange &dst_bfr_in)
{
BEESTRACE("replace_dst dst_bfr " << dst_bfr);
BEESTRACE("replace_dst dst_bfr " << dst_bfr_in);
BEESCOUNT(replacedst_try);
// Open dst, reuse it for all src
BEESNOTE("Opening dst bfr " << dst_bfr);
BEESTRACE("Opening dst bfr " << dst_bfr);
BEESNOTE("Opening dst bfr " << dst_bfr_in);
BEESTRACE("Opening dst bfr " << dst_bfr_in);
auto dst_bfr = dst_bfr_in;
dst_bfr.fd(m_ctx);
BeesFileRange overlap_bfr;
BEESTRACE("overlap_bfr " << overlap_bfr);
BeesBlockData bbd(dst_bfr);
BeesRangePair rv = { BeesFileRange(), BeesFileRange() };
for_each_extent_ref(bbd, [&](const BeesFileRange &src_bfr) -> bool {
for_each_extent_ref(bbd, [&](const BeesFileRange &src_bfr_in) -> bool {
// Open src
BEESNOTE("Opening src bfr " << src_bfr);
BEESTRACE("Opening src bfr " << src_bfr);
BEESNOTE("Opening src bfr " << src_bfr_in);
BEESTRACE("Opening src bfr " << src_bfr_in);
auto src_bfr = src_bfr_in;
src_bfr.fd(m_ctx);
if (dst_bfr.overlaps(src_bfr)) {
@@ -421,7 +419,9 @@ BeesResolver::replace_dst(const BeesFileRange &dst_bfr)
BeesBlockData src_bbd(src_bfr.fd(), src_bfr.begin(), min(BLOCK_SIZE_SUMS, src_bfr.size()));
if (bbd.addr().get_physical_or_zero() == src_bbd.addr().get_physical_or_zero()) {
BEESCOUNT(replacedst_same);
return false; // i.e. continue
// stop looping here, all the other srcs will probably fail this test too
BeesTracer::set_silent();
throw runtime_error("FIXME: too many duplicate candidates, bailing out here");
}
// Make pair(src, dst)
@@ -437,21 +437,12 @@ BeesResolver::replace_dst(const BeesFileRange &dst_bfr)
BEESCOUNT(replacedst_grown);
}
// Dedup
BEESNOTE("dedup " << brp);
if (m_ctx->dedup(brp)) {
BEESCOUNT(replacedst_dedup_hit);
m_found_dup = true;
overlap_bfr = brp.second;
// FIXME: find best range first, then dedup that
return true; // i.e. break
} else {
BEESCOUNT(replacedst_dedup_miss);
return false; // i.e. continue
}
rv = brp;
m_found_dup = true;
return true;
});
// BEESLOG("overlap_bfr after " << overlap_bfr);
return overlap_bfr.copy_closed();
return rv;
}
BeesFileRange
@@ -477,11 +468,6 @@ BeesResolver::find_all_matches(BeesBlockData &bbd)
bool
BeesResolver::operator<(const BeesResolver &that) const
{
if (that.m_bior_count < m_bior_count) {
return true;
} else if (m_bior_count < that.m_bior_count) {
return false;
}
return m_addr < that.m_addr;
// Lowest count, highest address
return tie(that.m_bior_count, m_addr) < tie(m_bior_count, that.m_addr);
}

File diff suppressed because it is too large Load Diff

View File

@@ -13,19 +13,16 @@ void
BeesThread::exec(function<void()> func)
{
m_timer.reset();
BEESLOG("BeesThread exec " << m_name);
BEESLOGDEBUG("BeesThread exec " << m_name);
m_thread_ptr = make_shared<thread>([=]() {
BEESLOG("Starting thread " << m_name);
BeesNote::set_name(m_name);
BEESLOGDEBUG("Starting thread " << m_name);
BEESNOTE("thread function");
Timer thread_time;
catch_all([&]() {
DIE_IF_MINUS_ERRNO(pthread_setname_np(pthread_self(), m_name.c_str()));
});
catch_all([&]() {
func();
});
BEESLOG("Exiting thread " << m_name << ", " << thread_time << " sec");
BEESLOGDEBUG("Exiting thread " << m_name << ", " << thread_time << " sec");
});
}
@@ -33,7 +30,7 @@ BeesThread::BeesThread(string name, function<void()> func) :
m_name(name)
{
THROW_CHECK1(invalid_argument, name, !name.empty());
BEESLOG("BeesThread construct " << m_name);
BEESLOGDEBUG("BeesThread construct " << m_name);
exec(func);
}
@@ -41,20 +38,20 @@ void
BeesThread::join()
{
if (!m_thread_ptr) {
BEESLOG("Thread " << m_name << " no thread ptr");
BEESLOGDEBUG("Thread " << m_name << " no thread ptr");
return;
}
BEESLOG("BeesThread::join " << m_name);
BEESLOGDEBUG("BeesThread::join " << m_name);
if (m_thread_ptr->joinable()) {
BEESLOG("Joining thread " << m_name);
BEESLOGDEBUG("Joining thread " << m_name);
Timer thread_time;
m_thread_ptr->join();
BEESLOG("Waited for " << m_name << ", " << thread_time << " sec");
BEESLOGDEBUG("Waited for " << m_name << ", " << thread_time << " sec");
} else if (!m_name.empty()) {
BEESLOG("BeesThread " << m_name << " not joinable");
BEESLOGDEBUG("BeesThread " << m_name << " not joinable");
} else {
BEESLOG("BeesThread else " << m_name);
BEESLOGDEBUG("BeesThread else " << m_name);
}
}
@@ -67,25 +64,20 @@ BeesThread::set_name(const string &name)
BeesThread::~BeesThread()
{
if (!m_thread_ptr) {
BEESLOG("Thread " << m_name << " no thread ptr");
BEESLOGDEBUG("Thread " << m_name << " no thread ptr");
return;
}
BEESLOG("BeesThread destructor " << m_name);
BEESLOGDEBUG("BeesThread destructor " << m_name);
if (m_thread_ptr->joinable()) {
BEESLOG("Cancelling thread " << m_name);
int rv = pthread_cancel(m_thread_ptr->native_handle());
if (rv) {
BEESLOG("pthread_cancel returned " << strerror(-rv));
}
BEESLOG("Waiting for thread " << m_name);
BEESLOGDEBUG("Waiting for thread " << m_name);
Timer thread_time;
m_thread_ptr->join();
BEESLOG("Waited for " << m_name << ", " << thread_time << " sec");
BEESLOGDEBUG("Waited for " << m_name << ", " << thread_time << " sec");
} else if (!m_name.empty()) {
BEESLOG("Thread " << m_name << " not joinable");
BEESLOGDEBUG("Thread " << m_name << " not joinable");
} else {
BEESLOG("Thread destroy else " << m_name);
BEESLOGDEBUG("Thread destroy else " << m_name);
}
}

155
src/bees-trace.cc Normal file
View File

@@ -0,0 +1,155 @@
#include "bees.h"
// tracing ----------------------------------------
int bees_log_level = 8;
thread_local BeesTracer *BeesTracer::tl_next_tracer = nullptr;
thread_local bool BeesTracer::tl_first = true;
thread_local bool BeesTracer::tl_silent = false;
bool
exception_check()
{
#if __cplusplus >= 201703
return uncaught_exceptions();
#else
return uncaught_exception();
#endif
}
BeesTracer::~BeesTracer()
{
if (!tl_silent && exception_check()) {
if (tl_first) {
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE --- exception ---");
tl_first = false;
}
try {
m_func();
} catch (exception &e) {
BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception: " << e.what());
} catch (...) {
BEESLOG(BEES_TRACE_LEVEL, "TRACE: Nested exception ...");
}
if (!m_next_tracer) {
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- END TRACE --- exception ---");
}
}
tl_next_tracer = m_next_tracer;
if (!m_next_tracer) {
tl_silent = false;
tl_first = true;
}
}
BeesTracer::BeesTracer(const function<void()> &f, bool silent) :
m_func(f)
{
m_next_tracer = tl_next_tracer;
tl_next_tracer = this;
tl_silent = silent;
}
void
BeesTracer::trace_now()
{
BeesTracer *tp = tl_next_tracer;
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- BEGIN TRACE ---");
while (tp) {
tp->m_func();
tp = tp->m_next_tracer;
}
BEESLOG(BEES_TRACE_LEVEL, "TRACE: --- END TRACE ---");
}
bool
BeesTracer::get_silent()
{
return tl_silent;
}
void
BeesTracer::set_silent()
{
tl_silent = true;
}
thread_local BeesNote *BeesNote::tl_next = nullptr;
mutex BeesNote::s_mutex;
map<pid_t, BeesNote*> BeesNote::s_status;
thread_local string BeesNote::tl_name;
BeesNote::~BeesNote()
{
tl_next = m_prev;
unique_lock<mutex> lock(s_mutex);
if (tl_next) {
s_status[gettid()] = tl_next;
} else {
s_status.erase(gettid());
}
}
BeesNote::BeesNote(function<void(ostream &os)> f) :
m_func(f)
{
m_name = get_name();
m_prev = tl_next;
tl_next = this;
unique_lock<mutex> lock(s_mutex);
s_status[gettid()] = tl_next;
}
void
BeesNote::set_name(const string &name)
{
tl_name = name;
pthread_setname(name);
}
string
BeesNote::get_name()
{
// Use explicit name if given
if (!tl_name.empty()) {
return tl_name;
}
// Try a Task name. If there is one, return it, but do not
// remember it. Each output message may be a different Task.
// The current task is thread_local so we don't need to worry
// about it being destroyed under us.
auto current_task = Task::current_task();
if (current_task) {
return current_task.title();
}
// OK try the pthread name next.
// thread_getname_np returns process name
// ...by default? ...for the main thread?
// ...except during exception handling?
// ...randomly?
return pthread_getname();
}
BeesNote::ThreadStatusMap
BeesNote::get_status()
{
unique_lock<mutex> lock(s_mutex);
ThreadStatusMap rv;
for (auto t : s_status) {
ostringstream oss;
if (!t.second->m_name.empty()) {
oss << t.second->m_name << ": ";
}
if (t.second->m_timer.age() > BEES_TOO_LONG) {
oss << "[" << t.second->m_timer << "s] ";
}
t.second->m_func(oss);
rv[t.first] = oss.str();
}
return rv;
}

View File

@@ -1,6 +1,5 @@
#include "bees.h"
#include "crucible/crc64.h"
#include "crucible/limits.h"
#include "crucible/ntoa.h"
#include "crucible/string.h"
@@ -71,7 +70,18 @@ operator<<(ostream &os, const BeesFileRange &bfr)
if (bfr.end() == numeric_limits<off_t>::max()) {
os << "- [" << to_hex(bfr.begin()) << "..eof]";
} else {
os << pretty(bfr.size()) << " [" << to_hex(bfr.begin()) << ".." << to_hex(bfr.end()) << "]";
os << pretty(bfr.size()) << " ";
if (bfr.begin() != 0) {
os << "[" << to_hex(bfr.begin());
} else {
os << "(";
}
os << ".." << to_hex(bfr.end());
if (!!bfr.m_fd && bfr.end() >= bfr.file_size()) {
os << ")";
} else {
os << "]";
}
}
if (bfr.m_fid) {
os << " fid = " << bfr.m_fid;
@@ -92,8 +102,6 @@ operator<<(ostream &os, const BeesRangePair &brp)
<< "\ndst = " << brp.second.fd() << " " << name_fd(brp.second.fd());
}
mutex BeesFileRange::s_mutex;
bool
BeesFileRange::operator<(const BeesFileRange &that) const
{
@@ -145,14 +153,14 @@ off_t
BeesFileRange::file_size() const
{
if (m_file_size <= 0) {
// Use method fd() not member m_fd() so we hold lock
Stat st(fd());
m_file_size = st.st_size;
// These checks could trigger on valid input, but that would mean we have
// lost a race (e.g. a file was truncated while we were building a
// matching range pair with it). In such cases we should probably stop
// whatever we were doing and backtrack to some higher level anyway.
THROW_CHECK1(invalid_argument, m_file_size, m_file_size > 0);
// Well, OK, but we call this function from exception handlers...
THROW_CHECK1(invalid_argument, m_file_size, m_file_size >= 0);
// THROW_CHECK2(invalid_argument, m_file_size, m_end, m_end <= m_file_size || m_end == numeric_limits<off_t>::max());
}
return m_file_size;
@@ -175,34 +183,42 @@ BeesFileRange::grow_begin(off_t delta)
return m_begin;
}
off_t
BeesFileRange::shrink_begin(off_t delta)
{
THROW_CHECK1(invalid_argument, delta, delta > 0);
THROW_CHECK3(invalid_argument, delta, m_begin, m_end, delta + m_begin < m_end);
m_begin += delta;
return m_begin;
}
off_t
BeesFileRange::shrink_end(off_t delta)
{
THROW_CHECK1(invalid_argument, delta, delta > 0);
THROW_CHECK2(invalid_argument, delta, m_end, m_end >= delta);
m_end -= delta;
return m_end;
}
BeesFileRange::BeesFileRange(const BeesBlockData &bbd) :
m_fd(bbd.fd()),
m_begin(bbd.begin()),
m_end(bbd.end()),
m_file_size(-1)
m_end(bbd.end())
{
}
BeesFileRange::BeesFileRange(Fd fd, off_t begin, off_t end) :
m_fd(fd),
m_begin(begin),
m_end(end),
m_file_size(-1)
m_end(end)
{
}
BeesFileRange::BeesFileRange(const BeesFileId &fid, off_t begin, off_t end) :
m_fid(fid),
m_begin(begin),
m_end(end),
m_file_size(-1)
{
}
BeesFileRange::BeesFileRange() :
m_begin(0),
m_end(0),
m_file_size(-1)
m_end(end)
{
}
@@ -240,42 +256,6 @@ BeesFileRange::overlaps(const BeesFileRange &that) const
return false;
}
bool
BeesFileRange::coalesce(const BeesFileRange &that)
{
// Let's define coalesce-with-null as identity,
// and coalesce-null-with-null as coalesced
if (!*this) {
operator=(that);
return true;
}
if (!that) {
return true;
}
// Can't coalesce different files
if (!is_same_file(that)) return false;
pair<uint64_t, uint64_t> a(m_begin, m_end);
pair<uint64_t, uint64_t> b(that.m_begin, that.m_end);
// range a starts lower than or equal b
if (b.first < a.first) {
swap(a, b);
}
// if b starts within a, they overlap
// (and the intersecting region is b.first..min(a.second, b.second))
// (and the union region is a.first..max(a.second, b.second))
if (b.first >= a.first && b.first < a.second) {
m_begin = a.first;
m_end = max(a.second, b.second);
return true;
}
return false;
}
BeesFileRange::operator BeesBlockData() const
{
BEESTRACE("operator BeesBlockData " << *this);
@@ -285,22 +265,18 @@ BeesFileRange::operator BeesBlockData() const
Fd
BeesFileRange::fd() const
{
unique_lock<mutex> lock(s_mutex);
return m_fd;
}
Fd
BeesFileRange::fd(const shared_ptr<BeesContext> &ctx) const
BeesFileRange::fd(const shared_ptr<BeesContext> &ctx)
{
unique_lock<mutex> lock(s_mutex);
// If we don't have a fid we can't do much here
if (m_fid) {
if (!m_fd) {
// If we don't have a fd, open by fid
if (m_fid && ctx) {
lock.unlock();
Fd new_fd = ctx->roots()->open_root_ino(m_fid);
lock.lock();
m_fd = new_fd;
}
} else {
@@ -374,6 +350,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
BEESTOOLONG("grow constrained = " << constrained << " *this = " << *this);
BEESTRACE("grow constrained = " << constrained << " *this = " << *this);
bool rv = false;
Timer grow_backward_timer;
THROW_CHECK1(invalid_argument, first.begin(), (first.begin() & BLOCK_MASK_CLONE) == 0);
THROW_CHECK1(invalid_argument, second.begin(), (second.begin() & BLOCK_MASK_CLONE) == 0);
@@ -390,8 +367,8 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
BEESTRACE("e_second " << e_second);
// Preread entire extent
posix_fadvise(second.fd(), e_second.begin(), e_second.size(), POSIX_FADV_WILLNEED);
posix_fadvise(first.fd(), e_second.begin() + first.begin() - second.begin(), e_second.size(), POSIX_FADV_WILLNEED);
bees_readahead_pair(second.fd(), e_second.begin(), e_second.size(),
first.fd(), e_second.begin() + first.begin() - second.begin(), e_second.size());
auto hash_table = ctx->hash_table();
@@ -410,7 +387,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
BEESCOUNT(pairbackward_hole);
break;
}
posix_fadvise(second.fd(), e_second.begin(), e_second.size(), POSIX_FADV_WILLNEED);
bees_readahead(second.fd(), e_second.begin(), e_second.size());
#else
// This tends to repeatedly process extents that were recently processed.
// We tend to catch duplicate blocks early since we scan them forwards.
@@ -429,17 +406,6 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
break;
}
// Source extent cannot be toxic
BeesAddress first_addr(first.fd(), new_first.begin());
if (!first_addr.is_magic()) {
auto first_resolved = ctx->resolve_addr(first_addr);
if (first_resolved.is_toxic()) {
BEESLOG("WORKAROUND: not growing matching pair backward because src addr is toxic:\n" << *this);
BEESCOUNT(pairbackward_toxic_addr);
break;
}
}
// Extend second range. If we hit BOF we can go no further.
BeesFileRange new_second = second;
BEESTRACE("new_second = " << new_second);
@@ -475,6 +441,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
// Source block cannot be zero in a non-compressed non-magic extent
BeesAddress first_addr(first.fd(), new_first.begin());
if (first_bbd.is_data_zero() && !first_addr.is_magic() && !first_addr.is_compressed()) {
BEESCOUNT(pairbackward_zero);
break;
@@ -490,7 +457,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
}
if (found_toxic) {
BEESLOG("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending backward:\n" << *this);
BEESCOUNT(pairbackward_toxic_hash);
break;
}
@@ -502,9 +469,11 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
BEESCOUNT(pairbackward_hit);
}
BEESCOUNT(pairbackward_stop);
BEESCOUNTADD(pairbackward_ms, grow_backward_timer.age() * 1000);
// Look forward
BEESTRACE("grow_forward " << *this);
Timer grow_forward_timer;
while (first.size() < BLOCK_SIZE_MAX_EXTENT) {
if (second.end() >= e_second.end()) {
if (constrained) {
@@ -517,7 +486,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
BEESCOUNT(pairforward_hole);
break;
}
posix_fadvise(second.fd(), e_second.begin(), e_second.size(), POSIX_FADV_WILLNEED);
bees_readahead(second.fd(), e_second.begin(), e_second.size());
}
BEESCOUNT(pairforward_try);
@@ -530,17 +499,6 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
break;
}
// Source extent cannot be toxic
BeesAddress first_addr(first.fd(), new_first.begin());
if (!first_addr.is_magic()) {
auto first_resolved = ctx->resolve_addr(first_addr);
if (first_resolved.is_toxic()) {
BEESLOG("WORKAROUND: not growing matching pair forward because src is toxic:\n" << *this);
BEESCOUNT(pairforward_toxic);
break;
}
}
// Extend second range. If we hit EOF we can go no further.
BeesFileRange new_second = second;
BEESTRACE("new_second = " << new_second);
@@ -584,6 +542,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
// Source block cannot be zero in a non-compressed non-magic extent
BeesAddress first_addr(first.fd(), new_first.begin());
if (first_bbd.is_data_zero() && !first_addr.is_magic() && !first_addr.is_compressed()) {
BEESCOUNT(pairforward_zero);
break;
@@ -599,7 +558,7 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
}
if (found_toxic) {
BEESLOG("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
BEESLOGDEBUG("WORKAROUND: found toxic hash in " << first_bbd << " while extending forward:\n" << *this);
BEESCOUNT(pairforward_toxic_hash);
break;
}
@@ -613,11 +572,12 @@ BeesRangePair::grow(shared_ptr<BeesContext> ctx, bool constrained)
}
if (first.overlaps(second)) {
BEESLOGTRACE("after grow, first " << first << "\n\toverlaps " << second);
BEESLOGDEBUG("after grow, first " << first << "\n\toverlaps " << second);
BEESCOUNT(bug_grow_pair_overlaps);
}
BEESCOUNT(pairforward_stop);
BEESCOUNTADD(pairforward_ms, grow_forward_timer.age() * 1000);
return rv;
}
@@ -627,6 +587,22 @@ BeesRangePair::copy_closed() const
return BeesRangePair(first.copy_closed(), second.copy_closed());
}
void
BeesRangePair::shrink_begin(off_t const delta)
{
first.shrink_begin(delta);
second.shrink_begin(delta);
THROW_CHECK2(runtime_error, first.size(), second.size(), first.size() == second.size());
}
void
BeesRangePair::shrink_end(off_t const delta)
{
first.shrink_end(delta);
second.shrink_end(delta);
THROW_CHECK2(runtime_error, first.size(), second.size(), first.size() == second.size());
}
ostream &
operator<<(ostream &os, const BeesAddress &ba)
{
@@ -698,7 +674,7 @@ BeesAddress::magic_check(uint64_t flags)
static const unsigned recognized_flags = compressed_flags | delalloc_flags | ignore_flags | unusable_flags;
if (flags & ~recognized_flags) {
BEESLOGTRACE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
BEESLOGNOTICE("Unrecognized flags in " << fiemap_extent_flags_ntoa(flags));
m_addr = UNUSABLE;
// maybe we throw here?
BEESCOUNT(addr_unrecognized);
@@ -878,6 +854,9 @@ operator<<(ostream &os, const BeesBlockData &bbd)
os << ", hash = " << bbd.m_hash;
}
if (!bbd.m_data.empty()) {
// Turn this on to debug BeesBlockData, but leave it off otherwise.
// It's a massive data leak that is only interesting to developers.
#if 0
os << ", data[" << bbd.m_data.size() << "] = '";
size_t max_print = 12;
@@ -894,6 +873,9 @@ operator<<(ostream &os, const BeesBlockData &bbd)
}
}
os << "...'";
#else
os << ", data[" << bbd.m_data.size() << "]";
#endif
}
return os << " }";
}
@@ -936,12 +918,13 @@ BeesBlockData::data() const
{
if (m_data.empty()) {
THROW_CHECK1(invalid_argument, size(), size() > 0);
BEESNOTE("Reading BeesBlockData " << *this);
BEESTOOLONG("Reading BeesBlockData " << *this);
Timer read_timer;
Blob rv(m_length);
Blob rv(size());
pread_or_die(m_fd, rv, m_offset);
THROW_CHECK2(runtime_error, rv.size(), m_length, ranged_cast<off_t>(rv.size()) == m_length);
THROW_CHECK2(runtime_error, rv.size(), size(), ranged_cast<off_t>(rv.size()) == size());
m_data = rv;
BEESCOUNT(block_read);
BEESCOUNTADD(block_bytes, rv.size());
@@ -955,14 +938,10 @@ BeesHash
BeesBlockData::hash() const
{
if (!m_hash_done) {
// We can only dedup unaligned EOF blocks against other unaligned EOF blocks,
// We can only dedupe unaligned EOF blocks against other unaligned EOF blocks,
// so we do NOT round up to a full sum block size.
const Blob &blob = data();
// TODO: It turns out that file formats with 4K block
// alignment and embedded CRC64 do exist, and every block
// of such files has the same hash. Could use a subset
// of SHA1 here instead.
m_hash = Digest::CRC::crc64(blob.data(), blob.size());
m_hash = BeesHash(blob.data(), blob.size());
m_hash_done = true;
BEESCOUNT(block_hash);
}
@@ -974,9 +953,8 @@ bool
BeesBlockData::is_data_zero() const
{
// The CRC64 of zero is zero, so skip some work if we already know the CRC
if (m_hash_done && m_hash != 0) {
return false;
}
// ...but that doesn't work for any other hash function, and it
// saves us next to nothing.
// OK read block (maybe) and check every byte
for (auto c : data()) {

36
src/bees-usage.txt Normal file
View File

@@ -0,0 +1,36 @@
Usage: %s [options] fs-root-path
Performs best-effort extent-same deduplication on btrfs.
fs-root-path MUST be the root of a btrfs filesystem tree (subvol id 5).
Other directories will be rejected.
Options:
-h, --help Show this help
Load management options:
-c, --thread-count Worker thread count (default CPU count * factor)
-C, --thread-factor Worker thread factor (default 1)
-G, --thread-min Minimum worker thread count (default 0)
-g, --loadavg-target Target load average for worker threads (default none)
--throttle-factor Idle time between operations (default 1.0)
Filesystem tree traversal options:
-m, --scan-mode Scanning mode (0..4, default 4)
Workarounds:
-a, --workaround-btrfs-send Workaround for btrfs send
(ignore RO snapshots)
Logging options:
-t, --timestamps Show timestamps in log output (default)
-T, --no-timestamps Omit timestamps in log output
-p, --absolute-paths Show absolute paths (default)
-P, --strip-paths Strip $CWD from beginning of all paths in the log
-v, --verbose Set maximum log level (0..8, default 8)
Optional environment variables:
BEESHOME Path to hash table and configuration files
(default is .beeshome/ in the root of the filesystem).
BEESSTATUS File to write status to (tmpfs recommended, e.g. /run).
No status is written if this variable is unset.

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,7 @@
#ifndef BEES_H
#define BEES_H
#include "crucible/bool.h"
#include "crucible/btrfs-tree.h"
#include "crucible/cache.h"
#include "crucible/chatter.h"
#include "crucible/error.h"
@@ -9,18 +9,21 @@
#include "crucible/fd.h"
#include "crucible/fs.h"
#include "crucible/lockset.h"
#include "crucible/multilock.h"
#include "crucible/pool.h"
#include "crucible/progress.h"
#include "crucible/time.h"
#include "crucible/timequeue.h"
#include "crucible/workqueue.h"
#include "crucible/task.h"
#include <array>
#include <functional>
#include <list>
#include <mutex>
#include <string>
#include <random>
#include <thread>
#include <endian.h>
#include <syslog.h>
using namespace crucible;
using namespace std;
@@ -28,7 +31,7 @@ using namespace std;
// Block size for clone alignment (FIXME: should read this from /sys/fs/btrfs/<FS-UUID>/clone_alignment)
const off_t BLOCK_SIZE_CLONE = 4096;
// Block size for dedup checksums (arbitrary, but must be a multiple of clone alignment)
// Block size for dedupe checksums (arbitrary, but must be a multiple of clone alignment)
const off_t BLOCK_SIZE_SUMS = 4096;
// Block size for memory allocations and file mappings (FIXME: should be CPU page size)
@@ -40,13 +43,6 @@ const off_t BLOCK_SIZE_MAX_EXTENT_SAME = 4096 * 4096;
// Maximum length of a compressed extent in bytes
const off_t BLOCK_SIZE_MAX_COMPRESSED_EXTENT = 128 * 1024;
// Try to combine smaller extents into larger ones
const off_t BLOCK_SIZE_MIN_EXTENT_DEFRAG = BLOCK_SIZE_MAX_COMPRESSED_EXTENT;
// Avoid splitting extents that are already too small
const off_t BLOCK_SIZE_MIN_EXTENT_SPLIT = BLOCK_SIZE_MAX_COMPRESSED_EXTENT;
// const off_t BLOCK_SIZE_MIN_EXTENT_SPLIT = 1024LL * 1024 * 1024 * 1024;
// Maximum length of any extent in bytes
// except we've seen 1.03G extents...
// ...FIEMAP is slow and full of lies
@@ -55,59 +51,62 @@ const off_t BLOCK_SIZE_MAX_EXTENT = 128 * 1024 * 1024;
// Masks, so we don't have to write "(BLOCK_SIZE_CLONE - 1)" everywhere
const off_t BLOCK_MASK_CLONE = BLOCK_SIZE_CLONE - 1;
const off_t BLOCK_MASK_SUMS = BLOCK_SIZE_SUMS - 1;
const off_t BLOCK_MASK_MMAP = BLOCK_SIZE_MMAP - 1;
const off_t BLOCK_MASK_MAX_COMPRESSED_EXTENT = BLOCK_SIZE_MAX_COMPRESSED_EXTENT * 2 - 1;
// Maximum temporary file size
// Maximum temporary file size (maximum extent size for temporary copy)
const off_t BLOCK_SIZE_MAX_TEMP_FILE = 1024 * 1024 * 1024;
// Bucket size for hash table (size of one hash bucket)
const off_t BLOCK_SIZE_HASHTAB_BUCKET = BLOCK_SIZE_MMAP;
// Extent size for hash table (since the nocow file attribute does not seem to be working today)
const off_t BLOCK_SIZE_HASHTAB_EXTENT = 16 * 1024 * 1024;
const off_t BLOCK_SIZE_HASHTAB_EXTENT = BLOCK_SIZE_MAX_COMPRESSED_EXTENT;
// Bytes per second we want to flush (8GB every two hours)
const double BEES_FLUSH_RATE = 8.0 * 1024 * 1024 * 1024 / 7200.0;
// Bytes per second we want to flush from hash table
// Optimistic sustained write rate for SD cards
const double BEES_FLUSH_RATE = 128 * 1024;
// Interval between writing non-hash-table things to disk (15 minutes)
// Interval between writing crawl state to disk
const int BEES_WRITEBACK_INTERVAL = 900;
// Statistics reports while scanning
const int BEES_STATS_INTERVAL = 3600;
// Progress shows instantaneous rates and thread status
const int BEES_PROGRESS_INTERVAL = 3600;
const int BEES_PROGRESS_INTERVAL = BEES_STATS_INTERVAL;
// Status is output every freakin second. Use a ramdisk.
const int BEES_STATUS_INTERVAL = 1;
// Number of file FDs to cache when not in active use
const size_t BEES_FILE_FD_CACHE_SIZE = 524288;
// Number of root FDs to cache when not in active use
const size_t BEES_ROOT_FD_CACHE_SIZE = 65536;
// Number of FDs to open (rlimit)
const size_t BEES_OPEN_FILE_LIMIT = BEES_FILE_FD_CACHE_SIZE + BEES_ROOT_FD_CACHE_SIZE + 100;
// Worker thread factor (multiplied by detected number of CPU cores)
const double BEES_DEFAULT_THREAD_FACTOR = 1.0;
// Log warnings when an operation takes too long
const double BEES_TOO_LONG = 2.5;
const double BEES_TOO_LONG = 5.0;
// Avoid any extent where LOGICAL_INO takes this long
const double BEES_TOXIC_DURATION = 9.9;
// Avoid any extent where LOGICAL_INO takes this much kernel CPU time
const double BEES_TOXIC_SYS_DURATION = 5.0;
// How long we should wait for new btrfs transactions
const double BEES_COMMIT_INTERVAL = 900;
// Maximum number of refs to a single extent before we have other problems
// If we have more than 10K refs to an extent, adding another will save 0.01% space
const size_t BEES_MAX_EXTENT_REF_COUNT = 9999; // (16 * 1024 * 1024 / 24);
// How long between hash table histograms
const double BEES_HASH_TABLE_ANALYZE_INTERVAL = 3600;
const double BEES_HASH_TABLE_ANALYZE_INTERVAL = BEES_STATS_INTERVAL;
// Rate limiting of informational messages
const double BEES_INFO_RATE = 10.0;
const double BEES_INFO_BURST = 1.0;
// Wait at least this long for a new transid
const double BEES_TRANSID_POLL_INTERVAL = 30.0;
// After we have this many events queued, wait
const size_t BEES_MAX_QUEUE_SIZE = 1024;
// Read this many items at a time in SEARCHv2
const size_t BEES_MAX_CRAWL_SIZE = 4096;
// If an extent has this many refs, pretend it does not exist
// to avoid a crippling btrfs performance bug
// The actual limit in LOGICAL_INO seems to be 2730, but let's leave a little headroom
const size_t BEES_MAX_EXTENT_REF_COUNT = 2560;
// Workaround for silly dedupe / ineffective readahead behavior
const size_t BEES_READAHEAD_SIZE = 1024 * 1024;
// Flags
const int FLAGS_OPEN_COMMON = O_NOFOLLOW | O_NONBLOCK | O_CLOEXEC | O_NOATIME | O_LARGEFILE | O_NOCTTY;
@@ -122,19 +121,26 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
// macros ----------------------------------------
#define BEESLOG(x) do { Chatter c(BeesNote::get_name()); c << x; } while (0)
#define BEESLOGTRACE(x) do { BEESLOG(x); BeesTracer::trace_now(); } while (0)
#define BEESLOG(lv,x) do { if (lv < bees_log_level) { Chatter __chatter(lv, BeesNote::get_name()); __chatter << x; } } while (0)
#define BEESTRACE(x) BeesTracer SRSLY_WTF_C(beesTracer_, __LINE__) ([&]() { BEESLOG(x); })
#define BEES_TRACE_LEVEL LOG_DEBUG
#define BEESTRACE(x) BeesTracer SRSLY_WTF_C(beesTracer_, __LINE__) ([&]() { BEESLOG(BEES_TRACE_LEVEL, "TRACE: " << x << " at " << __FILE__ << ":" << __LINE__); })
#define BEESTOOLONG(x) BeesTooLong SRSLY_WTF_C(beesTooLong_, __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
#define BEESNOTE(x) BeesNote SRSLY_WTF_C(beesNote_, __LINE__) ([&](ostream &_btl_os) { _btl_os << x; })
#define BEESINFO(x) do { \
if (bees_info_rate_limit.is_ready()) { \
bees_info_rate_limit.borrow(1); \
Chatter c(BeesNote::get_name()); \
c << x; \
} \
} while (0)
#define BEESLOGERR(x) BEESLOG(LOG_ERR, x)
#define BEESLOGWARN(x) BEESLOG(LOG_WARNING, x)
#define BEESLOGNOTICE(x) BEESLOG(LOG_NOTICE, x)
#define BEESLOGINFO(x) BEESLOG(LOG_INFO, x)
#define BEESLOGDEBUG(x) BEESLOG(LOG_DEBUG, x)
#define BEESLOGONCE(__x) do { \
static bool already_logged = false; \
if (!already_logged) { \
already_logged = true; \
BEESLOGNOTICE(__x); \
} \
} while (false)
#define BEESCOUNT(stat) do { \
BeesStats::s_global.add_count(#stat); \
@@ -154,16 +160,16 @@ class BeesStatTmpl {
map<string, T> m_stats_map;
mutable mutex m_mutex;
T& at(string idx);
public:
BeesStatTmpl() = default;
BeesStatTmpl(const BeesStatTmpl &that);
BeesStatTmpl &operator=(const BeesStatTmpl &that);
void add_count(string idx, size_t amount = 1);
T& at(string idx);
T at(string idx) const;
friend ostream& operator<< <>(ostream &os, const BeesStatTmpl<T> &bs);
friend class BeesStats;
friend struct BeesStats;
};
using BeesRates = BeesStatTmpl<double>;
@@ -182,12 +188,16 @@ class BeesBlockData;
class BeesTracer {
function<void()> m_func;
BeesTracer *m_next_tracer = 0;
thread_local static BeesTracer *s_next_tracer;
thread_local static BeesTracer *tl_next_tracer;
thread_local static bool tl_silent;
thread_local static bool tl_first;
public:
BeesTracer(function<void()> f);
BeesTracer(const function<void()> &f, bool silent = false);
~BeesTracer();
static void trace_now();
static bool get_silent();
static void set_silent();
};
class BeesNote {
@@ -199,8 +209,8 @@ class BeesNote {
static mutex s_mutex;
static map<pid_t, BeesNote*> s_status;
thread_local static BeesNote *s_next;
thread_local static string s_name;
thread_local static BeesNote *tl_next;
thread_local static string tl_name;
public:
BeesNote(function<void(ostream &)> f);
@@ -250,15 +260,14 @@ ostream& operator<<(ostream &os, const BeesFileId &bfi);
class BeesFileRange {
protected:
static mutex s_mutex;
mutable Fd m_fd;
Fd m_fd;
mutable BeesFileId m_fid;
off_t m_begin, m_end;
mutable off_t m_file_size;
off_t m_begin = 0, m_end = 0;
mutable off_t m_file_size = -1;
public:
BeesFileRange();
BeesFileRange() = default;
BeesFileRange(Fd fd, off_t begin, off_t end);
BeesFileRange(const BeesFileId &fid, off_t begin, off_t end);
BeesFileRange(const BeesBlockData &bbd);
@@ -273,35 +282,36 @@ public:
bool is_same_file(const BeesFileRange &that) const;
bool overlaps(const BeesFileRange &that) const;
// If file ranges overlap, extends this to include that.
// Coalesce with empty bfr = non-empty bfr
bool coalesce(const BeesFileRange &that);
// Remove that from this, creating 0, 1, or 2 new objects
pair<BeesFileRange, BeesFileRange> subtract(const BeesFileRange &that) const;
off_t begin() const { return m_begin; }
off_t end() const { return m_end; }
off_t size() const;
// Lazy accessors
/// @{ Lazy accessors
off_t file_size() const;
BeesFileId fid() const;
/// @}
// Get the fd if there is one
/// Get the fd if there is one
Fd fd() const;
// Get the fd, opening it if necessary
Fd fd(const shared_ptr<BeesContext> &ctx) const;
/// Get the fd, opening it if necessary
Fd fd(const shared_ptr<BeesContext> &ctx);
/// Copy the BeesFileId but not the Fd
BeesFileRange copy_closed() const;
// Is it defined?
/// Is it defined?
operator bool() const { return !!m_fd || m_fid; }
// Make range larger
/// @{ Make range larger
off_t grow_end(off_t delta);
off_t grow_begin(off_t delta);
/// @}
/// @{ Make range smaller
off_t shrink_end(off_t delta);
off_t shrink_begin(off_t delta);
/// @}
friend ostream & operator<<(ostream &os, const BeesFileRange &bfr);
};
@@ -316,7 +326,6 @@ public:
// Blocks with no physical address (not yet allocated, hole, or "other").
// PREALLOC blocks have a physical address so they're not magic enough to be handled here.
// Compressed blocks have a physical address but it's two-dimensional.
enum MagicValue {
ZERO, // BeesAddress uninitialized
DELALLOC, // delayed allocation
@@ -328,6 +337,7 @@ public:
BeesAddress(Type addr = ZERO) : m_addr(addr) {}
BeesAddress(MagicValue addr) : m_addr(addr) {}
BeesAddress& operator=(const BeesAddress &that) = default;
BeesAddress(const BeesAddress &that) = default;
operator Type() const { return m_addr; }
bool operator==(const BeesAddress &that) const;
bool operator==(const MagicValue that) const { return *this == BeesAddress(that); }
@@ -371,9 +381,11 @@ class BeesStringFile {
size_t m_limit;
public:
BeesStringFile(Fd dir_fd, string name, size_t limit = 1024 * 1024);
BeesStringFile(Fd dir_fd, string name, size_t limit = 16 * 1024 * 1024);
string read();
void write(string contents);
void name(const string &new_name);
string name() const;
};
class BeesHashTable {
@@ -386,6 +398,7 @@ public:
HashType e_hash;
AddrType e_addr;
Cell(const Cell &) = default;
Cell &operator=(const Cell &) = default;
Cell(HashType hash, AddrType addr) : e_hash(hash), e_addr(addr) { }
bool operator==(const Cell &e) const { return tie(e_hash, e_addr) == tie(e.e_hash, e.e_addr); }
bool operator!=(const Cell &e) const { return tie(e_hash, e_addr) != tie(e.e_hash, e.e_addr); }
@@ -407,15 +420,17 @@ public:
uint8_t p_byte[BLOCK_SIZE_HASHTAB_EXTENT];
} __attribute__((packed));
BeesHashTable(shared_ptr<BeesContext> ctx, string filename);
BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t size = BLOCK_SIZE_HASHTAB_EXTENT);
~BeesHashTable();
void stop_request();
void stop_wait();
vector<Cell> find_cell(HashType hash);
bool push_random_hash_addr(HashType hash, AddrType addr);
void erase_hash_addr(HashType hash, AddrType addr);
bool push_front_hash_addr(HashType hash, AddrType addr);
void set_shared(bool shared);
bool flush_dirty_extent(uint64_t extent_index);
private:
string m_filename;
@@ -438,36 +453,52 @@ private:
uint64_t m_buckets;
uint64_t m_extents;
uint64_t m_cells;
set<uint64_t> m_buckets_dirty;
set<uint64_t> m_buckets_missing;
BeesThread m_writeback_thread;
BeesThread m_prefetch_thread;
RateLimiter m_flush_rate_limit;
RateLimiter m_prefetch_rate_limit;
mutex m_extent_mutex;
mutex m_bucket_mutex;
condition_variable m_condvar;
set<HashType> m_toxic_hashes;
BeesStringFile m_stats_file;
LockSet<uint64_t> m_extent_lock_set;
// Prefetch readahead hint
bool m_prefetch_running = false;
DefaultBool m_shared;
// Mutex/condvar for the writeback thread
mutex m_dirty_mutex;
condition_variable m_dirty_condvar;
bool m_dirty = false;
// Mutex/condvar to stop
mutex m_stop_mutex;
condition_variable m_stop_condvar;
bool m_stop_requested = false;
// Per-extent structures
struct ExtentMetaData {
shared_ptr<mutex> m_mutex_ptr; // Access serializer
bool m_dirty = false; // Needs to be written back to disk
bool m_missing = true; // Needs to be read from disk
ExtentMetaData();
};
vector<ExtentMetaData> m_extent_metadata;
void open_file();
void writeback_loop();
void prefetch_loop();
void try_mmap_flags(int flags);
pair<Cell *, Cell *> get_cell_range(HashType hash);
pair<uint8_t *, uint8_t *> get_extent_range(HashType hash);
void fetch_missing_extent(HashType hash);
void set_extent_dirty(HashType hash);
void flush_dirty_extents();
bool is_toxic_hash(HashType h) const;
void fetch_missing_extent_by_hash(HashType hash);
void fetch_missing_extent_by_index(uint64_t extent_index);
void set_extent_dirty_locked(uint64_t extent_index);
size_t flush_dirty_extents(bool slowly);
bool using_shared_map() const { return false; }
size_t hash_to_extent_index(HashType ht);
unique_lock<mutex> lock_extent_by_hash(HashType ht);
unique_lock<mutex> lock_extent_by_index(uint64_t extent_index);
BeesHashTable(const BeesHashTable &) = delete;
BeesHashTable &operator=(const BeesHashTable &) = delete;
static thread_local uniform_int_distribution<size_t> tl_distribution;
};
ostream &operator<<(ostream &os, const BeesHashTable::Cell &bhte);
@@ -487,63 +518,115 @@ class BeesCrawl {
shared_ptr<BeesContext> m_ctx;
mutex m_mutex;
set<BeesFileRange> m_extents;
DefaultBool m_deferred;
BtrfsTreeItem m_next_extent_data;
bool m_deferred = false;
bool m_finished = false;
mutex m_state_mutex;
BeesCrawlState m_state;
ProgressTracker<BeesCrawlState> m_state;
BtrfsTreeObjectFetcher m_btof;
bool fetch_extents();
void fetch_extents_harder();
bool next_transid();
bool restart_crawl_unlocked();
BeesFileRange bti_to_bfr(const BtrfsTreeItem &bti) const;
public:
BeesCrawl(shared_ptr<BeesContext> ctx, BeesCrawlState initial_state);
BeesFileRange peek_front();
BeesFileRange pop_front();
BeesCrawlState get_state();
ProgressTracker<BeesCrawlState>::ProgressHolder hold_state(const BeesCrawlState &bcs);
BeesCrawlState get_state_begin();
BeesCrawlState get_state_end() const;
void set_state(const BeesCrawlState &bcs);
void deferred(bool def_setting);
bool deferred() const;
bool finished() const;
bool restart_crawl();
};
class BeesRoots {
class BeesScanMode;
class BeesRoots : public enable_shared_from_this<BeesRoots> {
shared_ptr<BeesContext> m_ctx;
BeesStringFile m_crawl_state_file;
BeesCrawlState m_crawl_current;
map<uint64_t, shared_ptr<BeesCrawl>> m_root_crawl_map;
using CrawlMap = map<uint64_t, shared_ptr<BeesCrawl>>;
CrawlMap m_root_crawl_map;
mutex m_mutex;
condition_variable m_condvar;
DefaultBool m_crawl_dirty;
uint64_t m_crawl_dirty = 0;
uint64_t m_crawl_clean = 0;
Timer m_crawl_timer;
BeesThread m_crawl_thread;
BeesThread m_writeback_thread;
RateEstimator m_transid_re;
bool m_workaround_btrfs_send = false;
void insert_new_crawl();
void insert_root(const BeesCrawlState &bcs);
shared_ptr<BeesScanMode> m_scanner;
mutex m_tmpfiles_mutex;
map<BeesFileId, Fd> m_tmpfiles;
mutex m_stop_mutex;
condition_variable m_stop_condvar;
bool m_stop_requested = false;
CrawlMap insert_new_crawl();
Fd open_root_nocache(uint64_t root);
Fd open_root_ino_nocache(uint64_t root, uint64_t ino);
uint64_t transid_min();
uint64_t transid_max();
uint64_t transid_max_nocache();
void state_load();
ostream &state_to_stream(ostream &os);
void state_save();
void crawl_roots();
string crawl_state_filename() const;
BeesCrawlState crawl_state_get(uint64_t root);
void crawl_state_set_dirty();
void crawl_state_erase(const BeesCrawlState &bcs);
void crawl_thread();
void writeback_thread();
uint64_t next_root(uint64_t root = 0);
void current_state_set(const BeesCrawlState &bcs);
bool crawl_batch(shared_ptr<BeesCrawl> crawl);
void clear_caches();
shared_ptr<BeesCrawl> insert_root(const BeesCrawlState &bcs);
bool up_to_date(const BeesCrawlState &bcs);
friend class BeesFdCache;
friend class BeesCrawl;
friend class BeesFdCache;
friend class BeesScanMode;
friend class BeesScanModeSubvol;
friend class BeesScanModeExtent;
public:
BeesRoots(shared_ptr<BeesContext> ctx);
void start();
void stop_request();
void stop_wait();
void insert_tmpfile(Fd fd);
void erase_tmpfile(Fd fd);
Fd open_root(uint64_t root);
Fd open_root_ino(uint64_t root, uint64_t ino);
Fd open_root_ino(const BeesFileId &bfi) { return open_root_ino(bfi.root(), bfi.ino()); }
bool is_root_ro(uint64_t root);
enum ScanMode {
SCAN_MODE_LOCKSTEP,
SCAN_MODE_INDEPENDENT,
SCAN_MODE_SEQUENTIAL,
SCAN_MODE_RECENT,
SCAN_MODE_EXTENT,
SCAN_MODE_COUNT, // must be last
};
void set_scan_mode(ScanMode new_mode);
void set_workaround_btrfs_send(bool do_avoid);
uint64_t transid_min();
uint64_t transid_max();
void wait_for_transid(const uint64_t count);
};
struct BeesHash {
@@ -553,15 +636,16 @@ struct BeesHash {
BeesHash(Type that) : m_hash(that) { }
operator Type() const { return m_hash; }
BeesHash& operator=(const Type that) { m_hash = that; return *this; }
BeesHash(const uint8_t *ptr, size_t len);
private:
Type m_hash;
};
ostream & operator<<(ostream &os, const BeesHash &bh);
class BeesBlockData {
using Blob = vector<char>;
using Blob = ByteVector;
mutable Fd m_fd;
off_t m_offset;
@@ -569,7 +653,7 @@ class BeesBlockData {
mutable BeesAddress m_addr;
mutable Blob m_data;
mutable BeesHash m_hash;
mutable DefaultBool m_hash_done;
mutable bool m_hash_done = false;
public:
// Constructor with the immutable fields
@@ -602,135 +686,125 @@ class BeesRangePair : public pair<BeesFileRange, BeesFileRange> {
public:
BeesRangePair(const BeesFileRange &src, const BeesFileRange &dst);
bool grow(shared_ptr<BeesContext> ctx, bool constrained);
void shrink_begin(const off_t delta);
void shrink_end(const off_t delta);
BeesRangePair copy_closed() const;
bool operator<(const BeesRangePair &that) const;
friend ostream & operator<<(ostream &os, const BeesRangePair &brp);
};
class BeesWorkQueueBase {
string m_name;
protected:
static mutex s_mutex;
static set<BeesWorkQueueBase *> s_all_workers;
public:
virtual ~BeesWorkQueueBase();
BeesWorkQueueBase(const string &name);
string name() const;
void name(const string &new_name);
virtual size_t active_size() const = 0;
virtual list<string> peek_active(size_t count) const = 0;
static void for_each_work_queue(function<void(BeesWorkQueueBase *)> f);
};
template <class Task>
class BeesWorkQueue : public BeesWorkQueueBase {
WorkQueue<Task> m_active_queue;
public:
BeesWorkQueue(const string &name);
~BeesWorkQueue();
void push_active(const Task &task, size_t limit);
void push_active(const Task &task);
size_t active_size() const override;
list<string> peek_active(size_t count) const override;
Task pop();
};
class BeesTempFile {
shared_ptr<BeesContext> m_ctx;
shared_ptr<BeesRoots> m_roots;
Fd m_fd;
off_t m_end_offset;
void create();
void realign();
void resize(off_t new_end_offset);
public:
~BeesTempFile();
BeesTempFile(shared_ptr<BeesContext> ctx);
BeesFileRange make_hole(off_t count);
BeesFileRange make_copy(const BeesFileRange &src);
void reset();
};
class BeesFdCache {
LRUCache<Fd, shared_ptr<BeesContext>, uint64_t> m_root_cache;
LRUCache<Fd, shared_ptr<BeesContext>, uint64_t, uint64_t> m_file_cache;
Timer m_root_cache_timer;
shared_ptr<BeesContext> m_ctx;
LRUCache<Fd, uint64_t> m_root_cache;
LRUCache<Fd, uint64_t, uint64_t> m_file_cache;
Timer m_root_cache_timer;
Timer m_file_cache_timer;
public:
BeesFdCache();
Fd open_root(shared_ptr<BeesContext> ctx, uint64_t root);
Fd open_root_ino(shared_ptr<BeesContext> ctx, uint64_t root, uint64_t ino);
void insert_root_ino(shared_ptr<BeesContext> ctx, Fd fd);
BeesFdCache(shared_ptr<BeesContext> ctx);
Fd open_root(uint64_t root);
Fd open_root_ino(uint64_t root, uint64_t ino);
void clear();
};
struct BeesResolveAddrResult {
BeesResolveAddrResult();
vector<BtrfsInodeOffsetRoot> m_biors;
DefaultBool m_is_toxic;
bool m_is_toxic = false;
bool is_toxic() const { return m_is_toxic; }
};
class BeesContext : public enable_shared_from_this<BeesContext> {
shared_ptr<BeesContext> m_parent_ctx;
Fd m_home_fd;
shared_ptr<BeesFdCache> m_fd_cache;
shared_ptr<BeesHashTable> m_hash_table;
shared_ptr<BeesRoots> m_roots;
map<thread::id, shared_ptr<BeesTempFile>> m_tmpfiles;
Pool<BeesTempFile> m_tmpfile_pool;
Pool<BtrfsIoctlLogicalInoArgs> m_logical_ino_pool;
LRUCache<BeesResolveAddrResult, BeesAddress> m_resolve_cache;
string m_root_path;
Fd m_root_fd;
string m_root_uuid;
mutable mutex m_blacklist_mutex;
set<BeesFileId> m_blacklist;
string m_uuid;
Timer m_total_timer;
NamedPtr<Exclusion, uint64_t> m_extent_locks;
NamedPtr<Exclusion, uint64_t> m_inode_locks;
mutable mutex m_stop_mutex;
condition_variable m_stop_condvar;
bool m_stop_requested = false;
bool m_stop_status = false;
shared_ptr<BeesThread> m_progress_thread;
shared_ptr<BeesThread> m_status_thread;
mutex m_progress_mtx;
string m_progress_str;
void set_root_fd(Fd fd);
BeesResolveAddrResult resolve_addr_uncached(BeesAddress addr);
BeesFileRange scan_one_extent(const BeesFileRange &bfr, const Extent &e);
void scan_one_extent(const BeesFileRange &bfr, const Extent &e);
void rewrite_file_range(const BeesFileRange &bfr);
public:
BeesContext(shared_ptr<BeesContext> parent_ctx = nullptr);
void set_root_path(string path);
Fd root_fd() const { return m_root_fd; }
Fd home_fd() const { return m_home_fd; }
Fd home_fd();
string root_path() const { return m_root_path; }
string root_uuid() const { return m_root_uuid; }
BeesFileRange scan_forward(const BeesFileRange &bfr);
bool scan_forward(const BeesFileRange &bfr);
BeesRangePair dup_extent(const BeesFileRange &src);
shared_ptr<BtrfsIoctlLogicalInoArgs> logical_ino(uint64_t bytenr, bool all_refs);
bool is_root_ro(uint64_t root);
BeesRangePair dup_extent(const BeesFileRange &src, const shared_ptr<BeesTempFile> &tmpfile);
bool dedup(const BeesRangePair &brp);
void blacklist_add(const BeesFileId &fid);
void blacklist_insert(const BeesFileId &fid);
void blacklist_erase(const BeesFileId &fid);
bool is_blacklisted(const BeesFileId &fid) const;
shared_ptr<Exclusion> get_inode_mutex(uint64_t inode);
BeesResolveAddrResult resolve_addr(BeesAddress addr);
void invalidate_addr(BeesAddress addr);
void resolve_cache_clear();
void dump_status();
void show_progress();
void set_progress(const string &str);
string get_progress();
void start();
void stop();
bool stop_requested() const;
shared_ptr<BeesFdCache> fd_cache();
shared_ptr<BeesHashTable> hash_table();
@@ -738,9 +812,6 @@ public:
shared_ptr<BeesTempFile> tmpfile();
const Timer &total_timer() const { return m_total_timer; }
// TODO: move the rest of the FD cache methods here
void insert_root_ino(Fd fd);
};
class BeesResolver {
@@ -748,25 +819,25 @@ class BeesResolver {
BeesAddress m_addr;
vector<BtrfsInodeOffsetRoot> m_biors;
set<BeesFileRange> m_ranges;
unsigned m_bior_count;
size_t m_bior_count;
// We found matching data, so we can dedup
DefaultBool m_found_data;
// We found matching data, so we can dedupe
bool m_found_data = false;
// We found matching data, so we *did* dedup
DefaultBool m_found_dup;
// We found matching data, so we *did* dedupe
bool m_found_dup = false;
// We found matching hash, so the hash table is still correct
DefaultBool m_found_hash;
bool m_found_hash = false;
// We found matching physical address, so the hash table isn't totally wrong
DefaultBool m_found_addr;
bool m_found_addr = false;
// We found matching physical address, but data did not match
DefaultBool m_wrong_data;
bool m_wrong_data = false;
// The whole thing is a placebo to avoid crippling btrfs performance bugs
DefaultBool m_is_toxic;
bool m_is_toxic = false;
BeesFileRange chase_extent_ref(const BtrfsInodeOffsetRoot &bior, BeesBlockData &needle_bbd);
BeesBlockData adjust_offset(const BeesFileRange &haystack, const BeesBlockData &needle);
@@ -792,7 +863,7 @@ public:
BeesFileRange find_one_match(BeesHash hash);
void replace_src(const BeesFileRange &src_bfr);
BeesFileRange replace_dst(const BeesFileRange &dst_bfr);
BeesRangePair replace_dst(const BeesFileRange &dst_bfr);
bool found_addr() const { return m_found_addr; }
bool found_data() const { return m_found_data; }
@@ -820,9 +891,16 @@ public:
};
// And now, a giant pile of extern declarations
extern int bees_log_level;
extern const char *BEES_USAGE;
extern const char *BEES_VERSION;
extern thread_local default_random_engine bees_generator;
string pretty(double d);
extern RateLimiter bees_info_rate_limit;
void bees_sync(int fd);
void bees_readahead(int fd, off_t offset, size_t size);
void bees_readahead_pair(int fd, off_t offset, size_t size, int fd2, off_t offset2, size_t size2);
void bees_unreadahead(int fd, off_t offset, size_t size);
void bees_throttle(double time_used, const char *context);
string format_time(time_t t);
bool exception_check();
#endif

View File

@@ -1,52 +0,0 @@
#include "crucible/fd.h"
#include "crucible/fs.h"
#include "crucible/error.h"
#include "crucible/string.h"
#include <iostream>
#include <fcntl.h>
#include <sys/stat.h>
#include <unistd.h>
using namespace crucible;
using namespace std;
int
main(int argc, char **argv)
{
catch_all([&]() {
THROW_CHECK1(invalid_argument, argc, argc > 1);
string filename = argv[1];
cout << "File: " << filename << endl;
Fd fd = open_or_die(filename, O_RDONLY);
Fiemap fm;
fm.m_max_count = 100;
if (argc > 2) { fm.fm_start = stoull(argv[2], nullptr, 0); }
if (argc > 3) { fm.fm_length = stoull(argv[3], nullptr, 0); }
if (argc > 4) { fm.fm_flags = stoull(argv[4], nullptr, 0); }
fm.fm_length = min(fm.fm_length, FIEMAP_MAX_OFFSET - fm.fm_start);
uint64_t stop_at = fm.fm_start + fm.fm_length;
uint64_t last_byte = fm.fm_start;
do {
fm.do_ioctl(fd);
// cerr << fm;
uint64_t last_logical = FIEMAP_MAX_OFFSET;
for (auto &extent : fm.m_extents) {
if (extent.fe_logical > last_byte) {
cout << "Log " << to_hex(last_byte) << ".." << to_hex(extent.fe_logical) << " Hole" << endl;
}
cout << "Log " << to_hex(extent.fe_logical) << ".." << to_hex(extent.fe_logical + extent.fe_length)
<< " Phy " << to_hex(extent.fe_physical) << ".." << to_hex(extent.fe_physical + extent.fe_length)
<< " Flags " << fiemap_extent_flags_ntoa(extent.fe_flags) << endl;
last_logical = extent.fe_logical + extent.fe_length;
last_byte = last_logical;
}
fm.fm_start = last_logical;
} while (fm.fm_start < stop_at);
});
exit(EXIT_SUCCESS);
}

View File

@@ -1,40 +0,0 @@
#include "crucible/extentwalker.h"
#include "crucible/error.h"
#include "crucible/string.h"
#include <iostream>
#include <fcntl.h>
#include <unistd.h>
using namespace crucible;
using namespace std;
int
main(int argc, char **argv)
{
catch_all([&]() {
THROW_CHECK1(invalid_argument, argc, argc > 1);
string filename = argv[1];
cout << "File: " << filename << endl;
Fd fd = open_or_die(filename, O_RDONLY);
BtrfsExtentWalker ew(fd);
off_t pos = 0;
if (argc > 2) { pos = stoull(argv[2], nullptr, 0); }
ew.seek(pos);
do {
// cout << "\n\n>>>" << ew.current() << "<<<\n\n" << endl;
cout << ew.current() << endl;
} while (ew.next());
#if 0
cout << "\n\n\nAnd now, backwards...\n\n\n" << endl;
do {
cout << "\n\n>>>" << ew.current() << "<<<\n\n" << endl;
} while (ew.prev());
cout << "\n\n\nDone!\n\n\n" << endl;
#endif
});
exit(EXIT_SUCCESS);
}

View File

@@ -1,36 +1,40 @@
PROGRAMS = \
chatter \
crc64 \
execpipe \
fd \
interp \
limits \
namedptr \
path \
process \
progress \
seeker \
table \
task \
all: test
test: $(PROGRAMS)
set -x; for prog in $(PROGRAMS); do ./$$prog || exit 1; done
test: $(PROGRAMS:%=%.txt) Makefile
FORCE:
include ../makeflags
-include ../localconf
LIBS = -lcrucible
LDFLAGS = -L../lib -Wl,-rpath=$(shell realpath ../lib)
LIBS = -lcrucible -lpthread
BEES_LDFLAGS = -L../lib $(LDFLAGS)
depends.mk: *.cc
for x in *.cc; do $(CXX) $(CXXFLAGS) -M "$$x"; done >> depends.mk.new
mv -fv depends.mk.new depends.mk
-include depends.mk
%.dep: %.cc tests.h Makefile
$(CXX) $(BEES_CXXFLAGS) -M -MF $@ -MT $(<:.cc=.o) $<
%.o: %.cc %.h ../makeflags
-echo "Implicit rule %.o: %.cc" >&2
$(CXX) $(CXXFLAGS) -o "$@" -c "$<"
include $(PROGRAMS:%=%.dep)
%: %.o ../makeflags
-echo "Implicit rule %: %.o" >&2
$(CXX) $(CXXFLAGS) -o "$@" "$<" $(LDFLAGS) $(LIBS)
$(PROGRAMS:%=%.o): %.o: %.cc ../makeflags Makefile
$(CXX) $(BEES_CXXFLAGS) -o $@ -c $<
$(PROGRAMS): %: %.o ../makeflags Makefile ../lib/libcrucible.a
$(CXX) $(BEES_CXXFLAGS) $(BEES_LDFLAGS) -o $@ $< $(LIBS)
%.txt: % Makefile FORCE
./$< >$@ 2>&1 || (RC=$$?; cat $@; exit $$RC)
clean:
-rm -fv *.o
rm -fv $(PROGRAMS:%=%.o) $(PROGRAMS:%=%.txt) $(PROGRAMS)

Some files were not shown because too many files have changed in this diff Show More