mirror of
https://github.com/Zygo/bees.git
synced 2025-05-17 21:35:45 +02:00
One of the more obvious ways to reduce bees load is to simply not run it all the time. Explicitly state using maintenance windows as a load management option. SIGUSR1 and SIGUSR2 should have been documented somewhere else before now. Better late than never. Signed-off-by: Zygo Blaxell <bees@furryterror.org>
318 lines
15 KiB
Markdown
318 lines
15 KiB
Markdown
bees Configuration
|
|
==================
|
|
|
|
The only configuration parameter that *must* be provided is the hash
|
|
table size. Other parameters are optional or hardcoded, and the defaults
|
|
are reasonable in most cases.
|
|
|
|
Hash Table Sizing
|
|
-----------------
|
|
|
|
Hash table entries are 16 bytes per data block. The hash table stores the
|
|
most recently read unique hashes. Once the hash table is full, each new
|
|
entry added to the table evicts an old entry. This makes the hash table
|
|
a sliding window over the most recently scanned data from the filesystem.
|
|
|
|
Here are some numbers to estimate appropriate hash table sizes:
|
|
|
|
unique data size | hash table size |average dedupe extent size
|
|
1TB | 4GB | 4K
|
|
1TB | 1GB | 16K
|
|
1TB | 256MB | 64K
|
|
1TB | 128MB | 128K <- recommended
|
|
1TB | 16MB | 1024K
|
|
64TB | 1GB | 1024K
|
|
|
|
Notes:
|
|
|
|
* If the hash table is too large, no extra dedupe efficiency is
|
|
obtained, and the extra space wastes RAM.
|
|
|
|
* If the hash table is too small, bees extrapolates from matching
|
|
blocks to find matching adjacent blocks in the filesystem that have been
|
|
evicted from the hash table. In other words, bees only needs to find
|
|
one block in common between two extents in order to be able to dedupe
|
|
the entire extents. This provides significantly more dedupe hit rate
|
|
per hash table byte than other dedupe tools.
|
|
|
|
* There is a fairly wide range of usable hash sizes, and performances
|
|
degrades according to a smooth probabilistic curve in both directions.
|
|
Double or half the optimium size usually works just as well.
|
|
|
|
* When counting unique data in compressed data blocks to estimate
|
|
optimum hash table size, count the *uncompressed* size of the data.
|
|
|
|
* Another way to approach the hash table size is to simply decide how much
|
|
RAM can be spared without too much discomfort, give bees that amount of
|
|
RAM, and accept whatever dedupe hit rate occurs as a result. bees will
|
|
do the best job it can with the RAM it is given.
|
|
|
|
Factors affecting optimal hash table size
|
|
-----------------------------------------
|
|
|
|
It is difficult to predict the net effect of data layout and access
|
|
patterns on dedupe effectiveness without performing deep inspection of
|
|
both the filesystem data and its structure--a task that is as expensive
|
|
as performing the deduplication.
|
|
|
|
* **Compression** in files reduces the average extent length compared
|
|
to uncompressed files. The maximum compressed extent length on
|
|
btrfs is 128KB, while the maximum uncompressed extent length is 128MB.
|
|
Longer extents decrease the optimum hash table size while shorter extents
|
|
increase the optimum hash table size, because the probability of a hash
|
|
table entry being present (i.e. unevicted) in each extent is proportional
|
|
to the extent length.
|
|
|
|
As a rule of thumb, the optimal hash table size for a compressed
|
|
filesystem is 2-4x larger than the optimal hash table size for the same
|
|
data on an uncompressed filesystem. Dedupe efficiency falls rapidly with
|
|
hash tables smaller than 128MB/TB as the average dedupe extent size is
|
|
larger than the largest possible compressed extent size (128KB).
|
|
|
|
* **Short writes or fragmentation** also shorten the average extent
|
|
length and increase optimum hash table size. If a database writes to
|
|
files randomly using 4K page writes, all of these extents will be 4K
|
|
in length, and the hash table size must be increased to retain each one
|
|
(or the user must accept a lower dedupe hit rate).
|
|
|
|
Defragmenting files that have had many short writes increases the
|
|
extent length and therefore reduces the optimum hash table size.
|
|
|
|
* **Time between duplicate writes** also affects the optimum hash table
|
|
size. bees reads data blocks in logical order during its first pass,
|
|
and after that new data blocks are read incrementally a few seconds or
|
|
minutes after they are written. bees finds more matching blocks if there
|
|
is a smaller amount of data between the matching reads, i.e. there are
|
|
fewer blocks evicted from the hash table. If most identical writes to
|
|
the filesystem occur near the same time, the optimum hash table size is
|
|
smaller. If most identical writes occur over longer intervals of time,
|
|
the optimum hash table size must be larger to avoid evicting hashes from
|
|
the table before matches are found.
|
|
|
|
For example, a build server normally writes out very similar source
|
|
code files over and over, so it will need a smaller hash table than a
|
|
backup server which has to refer to the oldest data on the filesystem
|
|
every time a new client machine's data is added to the server.
|
|
|
|
Scanning modes
|
|
--------------
|
|
|
|
The `--scan-mode` option affects how bees iterates over the filesystem,
|
|
schedules extents for scanning, and tracks progress.
|
|
|
|
There are now two kinds of scan mode: the legacy **subvol** scan modes,
|
|
and the new **extent** scan mode.
|
|
|
|
Scan mode can be changed by restarting bees with a different scan mode
|
|
option.
|
|
|
|
Extent scan mode:
|
|
|
|
* Works with 4.15 and later kernels.
|
|
* Can estimate progress and provide an ETA.
|
|
* Can optimize scanning order to dedupe large extents first.
|
|
* Can keep up with frequent creation and deletion of snapshots.
|
|
|
|
Subvol scan modes:
|
|
|
|
* Work with 4.14 and earlier kernels.
|
|
* Cannot estimate or report progress.
|
|
* Cannot optimize scanning order by extent size.
|
|
* Have problems keeping up with multiple snapshots created during a scan.
|
|
|
|
The default scan mode is 4, "extent".
|
|
|
|
If you are using bees for the first time on a filesystem with many
|
|
existing snapshots, you should read about [snapshot gotchas](gotchas.md).
|
|
|
|
Subvol scan modes
|
|
-----------------
|
|
|
|
Subvol scan modes are maintained for compatibility with existing
|
|
installations, but will not be developed further. New installations
|
|
should use extent scan mode instead.
|
|
|
|
The _quantity_ of text below detailing the shortcomings of each subvol
|
|
scan mode should be informative all by itself.
|
|
|
|
Subvol scan modes work on any kernel version supported by bees. They
|
|
are the only scan modes usable on kernel 4.14 and earlier.
|
|
|
|
The difference between the subvol scan modes is the order in which the
|
|
files from different subvols are fed into the scanner. They all scan
|
|
files in inode number order, from low to high offset within each inode,
|
|
the same way that a program like `cat` would read files (but skipping
|
|
over old data from earlier btrfs transactions).
|
|
|
|
If a filesystem has only one subvolume with data in it, then all of
|
|
the subvol scan modes are equivalent. In this case, there is only one
|
|
subvolume to scan, so every possible ordering of subvols is the same.
|
|
|
|
The `--workaround-btrfs-send` option pauses scanning subvols that are
|
|
read-only. If the subvol is made read-write (e.g. with `btrfs prop set
|
|
$subvol ro false`), or if the `--workaround-btrfs-send` option is removed,
|
|
then the scan of that subvol is unpaused and dedupe proceeds normally.
|
|
Space will only be recovered when the last read-only subvol is deleted.
|
|
|
|
Subvol scan modes cannot efficiently or accurately calculate an ETA for
|
|
completion or estimate progress through the data. They simply request
|
|
"the next new inode" from btrfs, and they are completed when btrfs says
|
|
there is no next new inode.
|
|
|
|
Between subvols, there are several scheduling algorithms with different
|
|
trade-offs:
|
|
|
|
Scan mode 0, "lockstep", scans the same inode number in each subvol at
|
|
close to the same time. This is useful if the subvols are snapshots
|
|
with a common ancestor, since the same inode number in each subvol will
|
|
have similar or identical contents. This maximizes the likelihood that
|
|
all of the references to a snapshot of a file are scanned at close to
|
|
the same time, improving dedupe hit rate. If the subvols are unrelated
|
|
(i.e. not snapshots of a single subvol) then this mode does not provide
|
|
any significant advantage. This mode uses smaller amounts of temporary
|
|
space for shorter periods of time when most subvols are snapshots. When a
|
|
new snapshot is created, this mode will stop scanning other subvols and
|
|
scan the new snapshot until the same inode number is reached in each
|
|
subvol, which will effectively stop dedupe temporarily as this data has
|
|
already been scanned and deduped in the other snapshots.
|
|
|
|
Scan mode 1, "independent", scans the next inode with new data in
|
|
each subvol. There is no coordination between the subvols, other than
|
|
round-robin distribution of files from each subvol to each worker thread.
|
|
This mode makes continuous forward progress in all subvols. When a new
|
|
snapshot is created, previous subvol scans continue as before, but the
|
|
worker threads are now divided among one more subvol.
|
|
|
|
Scan mode 2, "sequential", scans one subvol at a time, in numerical subvol
|
|
ID order, processing each subvol completely before proceeding to the next
|
|
subvol. This avoids spending time scanning short-lived snapshots that
|
|
will be deleted before they can be fully deduped (e.g. those used for
|
|
`btrfs send`). Scanning starts on older subvols that are more likely
|
|
to be origin subvols for future snapshots, eliminating the need to
|
|
dedupe future snapshots separately. This mode uses the largest amount
|
|
of temporary space for the longest time, and typically requires a larger
|
|
hash table to maintain dedupe hit rate.
|
|
|
|
Scan mode 3, "recent", scans the subvols with the highest `min_transid`
|
|
value first (i.e. the ones that were most recently completely scanned),
|
|
then falls back to "independent" mode to break ties. This interrupts
|
|
long scans of old subvols to give a rapid dedupe response to new data
|
|
in previously scanned subvols, then returns to the old subvols after
|
|
the new data is scanned.
|
|
|
|
Extent scan mode
|
|
----------------
|
|
|
|
Scan mode 4, "extent", scans the extent tree instead of the subvol trees.
|
|
Extent scan mode reads each extent once, regardless of the number of
|
|
reflinks or snapshots. It adapts to the creation of new snapshots
|
|
and reflinks immediately, without having to revisit old data.
|
|
|
|
In the extent scan mode, extents are separated into multiple size tiers
|
|
to prioritize large extents over small ones. Deduping large extents
|
|
keeps the metadata update cost low per block saved, resulting in faster
|
|
dedupe at the start of a scan cycle. This is important for maximizing
|
|
performance in use cases where bees runs for a limited time, such as
|
|
during an overnight maintenance window.
|
|
|
|
Once the larger size tiers are completed, dedupe space recovery speeds
|
|
slow down significantly. It may be desirable to stop bees running once
|
|
the larger size tiers are finished, then start bees running some time
|
|
later after new data has appeared.
|
|
|
|
Each extent is mapped in physical address order, and all extent references
|
|
are submitted to the scanner at the same time, resulting in much better
|
|
cache behavior and dedupe performance compared to the subvol scan modes.
|
|
|
|
The "extent" scan mode is not usable on kernels before 4.15 because
|
|
it relies on the `LOGICAL_INO_V2` ioctl added in that kernel release.
|
|
When using bees with an older kernel, only subvol scan modes will work.
|
|
|
|
Extents are divided into virtual subvols by size, using reserved btrfs
|
|
subvol IDs 250..255. The size tier groups are:
|
|
* 250: 32M+1 and larger
|
|
* 251: 8M+1..32M
|
|
* 252: 2M+1..8M
|
|
* 253: 512K+1..2M
|
|
* 254: 128K+1..512K
|
|
* 255: 128K and smaller (includes all compressed extents)
|
|
|
|
Extent scan mode can efficiently calculate dedupe progress within
|
|
the filesystem and estimate an ETA for completion within each size
|
|
tier; however, the accuracy of the ETA can be questionable due to the
|
|
non-uniform distribution of block addresses in a typical user filesystem.
|
|
|
|
Older versions of bees do not recognize the virtual subvols, so running
|
|
an old bees version after running a new bees version will reset the
|
|
"extent" scan mode's progress in `beescrawl.dat` to the beginning.
|
|
This may change in future bees releases, i.e. extent scans will store
|
|
their checkpoint data somewhere else.
|
|
|
|
The `--workaround-btrfs-send` option behaves differently in extent
|
|
scan modes: In extent scan mode, dedupe proceeds on all subvols that are
|
|
read-write, but all subvols that are read-only are excluded from dedupe.
|
|
Space will only be recovered when the last read-only subvol is deleted.
|
|
|
|
During `btrfs send` all duplicate extents in the sent subvol will not be
|
|
removed (the kernel will reject dedupe commands while send is active,
|
|
and bees currently will not re-issue them after the send is complete).
|
|
It may be preferable to terminate the bees process while running `btrfs
|
|
send` in extent scan mode, and restart bees after the `send` is complete.
|
|
|
|
Threads and load management
|
|
---------------------------
|
|
|
|
By default, bees creates one worker thread for each CPU detected. These
|
|
threads then perform scanning and dedupe operations. bees attempts to
|
|
maximize the amount of productive work each thread does, until either the
|
|
threads are all continuously busy, or there is no remaining work to do.
|
|
|
|
In many cases it is not desirable to continually run bees at maximum
|
|
performance. Maximum performance is not necessary if bees can dedupe
|
|
new data faster than it appears on the filesystem. If it only takes
|
|
bees 10 minutes per day to dedupe all new data on a filesystem, then
|
|
bees doesn't need to run for more than 10 minutes per day.
|
|
|
|
bees supports a number of options for reducing system load:
|
|
|
|
* Run bees for a few hours per day, at an off-peak time (i.e. during
|
|
a maintenace window), instead of running bees continuously. Any data
|
|
added to the filesystem while bees is not running will be scanned when
|
|
bees restarts. At the end of the maintenance window, terminate the
|
|
bees process with SIGTERM to write the hash table and scan position
|
|
for the next maintenance window.
|
|
|
|
* Temporarily pause bees operation by sending the bees process SIGUSR1,
|
|
and resume operation with SIGUSR2. This is preferable to freezing
|
|
and thawing the process, e.g. with freezer cgroups or SIGSTOP/SIGCONT
|
|
signals, because it allows bees to close open file handles that would
|
|
otherwise prevent those files from being deleted while bees is frozen.
|
|
|
|
* Reduce the number of worker threads with the [`--thread-count` or
|
|
`--thread-factor` options](options.md). This simply leaves CPU cores
|
|
idle so that other applications on the host can use them, or to save
|
|
power.
|
|
|
|
* Allow bees to automatically track system load and increase or decrease
|
|
the number of threads to reach a target system load. This reduces
|
|
impact on the rest of the system by pausing bees when other CPU and IO
|
|
intensive loads are active on the system, and resumes bees when the other
|
|
loads are inactive. This is configured with the [`--loadavg-target`
|
|
and `--thread-min` options](options.md).
|
|
|
|
* Allow bees to self-throttle operations that enqueue delayed work
|
|
within btrfs. These operations are not well controlled by Linux
|
|
features such as process priority or IO priority or IO rate-limiting,
|
|
because the enqueued work is submitted to btrfs several seconds before
|
|
btrfs performs the work. By the time btrfs performs the work, it's too
|
|
late for external throttling to be effective. The [`--throttle-factor`
|
|
option](options.md) tracks how long it takes btrfs to complete queued
|
|
operations, and reduces bees's queued work submission rate to match
|
|
btrfs's queued work completion rate (or a fraction thereof, to reduce
|
|
system load).
|
|
|
|
Log verbosity
|
|
-------------
|
|
|
|
bees can be made less chatty with the [`--verbose` option](options.md).
|