README.md: 32-bit hosts work now

resolve: don't stop at the first physical address lookup failure
The btrfs LOGICAL_INO ioctl has no way to report references to compressed blocks precisely, so we must always consider all references to a compressed block, and discard those that do not have the desired offset. When we encounter compressed shared extents containing a mix of unique and duplicate data, we attempt to replace all references to the mixed extent with the same number of references to multiple extents consisting entirely of unique or duplicate blocks. An early exit from the loop in BeesResolver::for_each_extent_ref was stopping this operation early, after replacing as few as one shared reference. This left other shared references to the unique data on the filesystem, effectively creating new dup data. The failing pattern looks like this: dedup: replace 0x14000..0x18000 from some other extent copy: 0x10000..0x14000 dedup: replace 0x10000..0x14000 with the copy [may be multiple dedup lines due to multiple shared references] copy: 0x18000..0x1c000 [missing dedup 0x18000..0x1c000 with the copy here] scan: 0x10000 [++++dddd++++] 0x1c000 If the extent 0x10000..0x1c000 is shared and compressed, we will make a copy of the extent at 0x18000..1c0000. When we try to dedup this copy extent, LOGICAL_INO will return a mix of references to the data at logical 0x10000 and 0x18000 (which are both references to the original shared extent with different offsets). If we break out of the loop too early, we will stop as soon as a reference to 0x10000 is found, and ignore all other references to the extent we are trying to remove. The copy at the beginning of the extent (0x10000..0x14000) usually works because all references to the extent cover the entire extent. When bees performs the dedup at 0x14000..0x18000, bees itself creates the shared references with different offsets. Uncompressed extents were not affected because LOGICAL_INO can locate physical blocks precisely if they reside in uncompressed extents. This change will hurt performance when looking up old physical addresses that belong to new data, but that is a much less urgent problem. Signed-off-by: Zygo Blaxell <bees@furryterror.org>
2025-08-02 13:53:28 +02:00 · 2016-12-27 18:01:30 -05:00 · 2016-12-27 15:23:40 -05:00 · 2016-12-27 15:20:31 -05:00 · 2016-12-27 15:15:42 -05:00 · 2016-12-27 15:15:42 -05:00
30 changed files with 605 additions and 243 deletions
--- a/README.md
+++ b/README.md
@@ -1,30 +1,52 @@
 BEES
 ====

-Best-Effort Extent-Same, a btrfs deduplication daemon.
+Best-Effort Extent-Same, a btrfs dedup agent.

 About Bees
 ----------

-Bees is a daemon designed to run continuously on live file servers.
-Bees scans and deduplicates whole filesystems in a single pass instead
-of separate scan and dedup phases.  RAM usage does _not_ depend on
-unique data size or the number of input files.  Hash tables and scan
-progress are stored persistently so the daemon can resume after a reboot.
-Bees uses the Linux kernel's `dedupe_file_range` feature to ensure data
-is handled safely even if other applications concurrently modify it.
+Bees is a block-oriented userspace dedup agent designed to avoid
+scalability problems on large filesystems.

-Bees is intentionally btrfs-specific for performance and capability.
-Bees uses the btrfs `SEARCH_V2` ioctl to scan for new data without the
-overhead of repeatedly walking filesystem trees with the POSIX API.
-Bees uses `LOGICAL_INO` and `INO_PATHS` to leverage btrfs's existing
-metadata instead of building its own redundant data structures.
-Bees can cope with Btrfs filesystem compression.  Bees can reassemble
-Btrfs extents to deduplicate extents that contain a mix of duplicate
-and unique data blocks.
+Bees is designed to degrade gracefully when underprovisioned with RAM.
+Bees does not use more RAM or storage as filesystem data size increases.
+The dedup hash table size is fixed at creation time and does not change.
+The effective dedup block size is dynamic and adjusts automatically to
+fit the hash table into the configured RAM limit.  Hash table overflow
+is not implemented to eliminate the IO overhead of hash table overflow.
+Hash table entries are only 16 bytes per dedup block to keep the average
+dedup block size small.

-Bees includes a number of workarounds for Btrfs kernel bugs to (try to)
-avoid ruining your day.  You're welcome.
+Bees does not require alignment between dedup blocks or extent boundaries
+(i.e. it can handle any multiple-of-4K offset between dup block pairs).
+Bees rearranges blocks into shared and unique extents if required to
+work within current btrfs kernel dedup limitations.
+
+Bees can dedup any combination of compressed and uncompressed extents.
+
+Bees operates in a single pass which removes duplicate extents immediately
+during scan.  There are no separate scanning and dedup phases.
+
+Bees uses only data-safe btrfs kernel operations, so it can dedup live
+data (e.g. build servers, sqlite databases, VM disk images).  It does
+not modify file attributes or timestamps.
+
+Bees does not store any information about filesystem structure, so it is
+not affected by the number or size of files (except to the extent that
+these cause performance problems for btrfs in general).  It retrieves such
+information on demand through btrfs SEARCH_V2 and LOGICAL_INO ioctls.
+This eliminates the storage required to maintain the equivalents of
+these functions in userspace.  It's also why bees has no XFS support.
+
+Bees is a daemon designed to run continuously and maintain its state
+across crahes and reboots.  Bees uses checkpoints for persistence to
+eliminate the IO overhead of a transactional data store.  On restart,
+bees will dedup any data that was added to the filesystem since the
+last checkpoint.
+
+Bees is used to dedup filesystems ranging in size from 16GB to 35TB, with
+hash tables ranging in size from 128MB to 11GB.

 How Bees Works
 --------------
@@ -78,11 +100,9 @@ and some metadata bits).  Each entry represents a minimum of 4K on disk.
        1TB                16MB               1024K
       64TB                 1GB               1024K

-It is possible to resize the hash table by changing the size of
-`beeshash.dat` (e.g. with `truncate`) and restarting `bees`.  This
-does not preserve all the existing hash table entries, but it does
-preserve more than zero of them--especially if the old and new sizes
-are a power-of-two multiple of each other.
+To change the size of the hash table, use 'truncate' to change the hash
+table size, delete `beescrawl.dat` so that bees will start over with a
+fresh full-filesystem rescan, and restart `bees'.

 Things You Might Expect That Bees Doesn't Have
 ----------------------------------------------
@@ -129,6 +149,9 @@ blocks, but has no defragmentation capability yet.  When possible, Bees
 will attempt to work with existing extent boundaries, but it will not
 aggregate blocks together from multiple extents to create larger ones.

+* It is possible to resize the hash table without starting over with
+a new full-filesystem scan; however, this has not been implemented yet.
+
 Good Btrfs Feature Interactions
 -------------------------------

@@ -144,19 +167,17 @@ Bees has been tested in combination with the following:
 * IO errors during dedup (read errors will throw exceptions, Bees will catch them and skip over the affected extent)
 * Filesystems mounted *with* the flushoncommit option
 * 4K filesystem data block size / clone alignment
-* 64-bit CPUs (amd64)
+* 64-bit and 32-bit host CPUs (amd64, x86, arm)
 * Large (>16M) extents
 * Huge files (>1TB--although Btrfs performance on such files isn't great in general)
 * filesystems up to 25T bytes, 100M+ files

-
 Bad Btrfs Feature Interactions
 ------------------------------

 Bees has not been tested with the following, and undesirable interactions may occur:

 * Non-4K filesystem data block size (should work if recompiled)
-* 32-bit CPUs (x86, arm)
 * Non-equal hash (SUM) and filesystem data block (CLONE) sizes (probably never will work)
 * btrfs read-only snapshots (never tested, probably wouldn't work well)
 * btrfs send/receive (receive is probably OK, but send requires RO snapshots.  See above)
@@ -200,16 +221,26 @@ Other Caveats
 A Brief List Of Btrfs Kernel Bugs
 ---------------------------------

-Fixed bugs:
+Missing features (usually not available in older LTS kernels):

 * 3.13: `FILE_EXTENT_SAME` ioctl added.  No way to reliably dedup with
  concurrent modifications before this.
 * 3.16: `SEARCH_V2` ioctl added.  Bees could use `SEARCH` instead.
 * 4.2: `FILE_EXTENT_SAME` no longer updates mtime, can be used at EOF.
-  Kernel deadlock bugs fixed.
+
+Bug fixes (sometimes included in older LTS kernels):
+
+* 4.5: hang in the `INO_PATHS` ioctl used by Bees.
+* 4.5: use-after-free in the `FILE_EXTENT_SAME` ioctl used by Bees.
 * 4.7: *slow backref* bug no longer triggers a softlockup panic.  It still
  too long to resolve a block address to a root/inode/offset triple.

+Fixed bugs not yet integrated in mainline Linux:
+
+* 7f8e406 ("btrfs: improve delayed refs iterations"): significantly
+  reduces the CPU time cost of the LOGICAL_INO ioctl (from 30-70% of
+  bees running time to under 5%).
+
 Unfixed kernel bugs (as of 4.5.7) with workarounds in Bees:

 * *slow backref*: If the number of references to a single shared extent
@@ -243,7 +274,7 @@ Unfixed kernel bugs (as of 4.5.7) with workarounds in Bees:
  precisely the specified range of offending fragmented blocks.

 * When writing BeesStringFile, a crash can cause the directory entry
-  `beescrawl.UUID.dat.tmp` to exist without a corresponding inode.
+  `beescrawl.dat.tmp` to exist without a corresponding inode.
  This directory entry cannot be renamed or removed; however, it does
  not prevent the creation of a second directory entry with the same
  name that functions normally, so it doesn't prevent Bees operation.
@@ -251,10 +282,13 @@ Unfixed kernel bugs (as of 4.5.7) with workarounds in Bees:
  The orphan directory entry can be removed by deleting its subvol,
  so place BEESHOME on a separate subvol so you can delete these orphan
  directory entries when they occur (or use btrfs zero-log before mounting
-  the filesystem after a crash).
+  the filesystem after a crash).  Alternatively, place BEESHOME on a
+  non-btrfs filesystem.

-* If the fsync() BeesTempFile::make_copy is removed, the filesystem
-  hangs within a few hours, requiring a reboot to recover.
+* If the `fsync()` in `BeesTempFile::make_copy` is removed, the filesystem
+  hangs within a few hours, requiring a reboot to recover.  On the other
+  hand, there may be net performance benefits to calling `fsync()` before
+  or after each dedup.  This needs further investigation.

 Not really a bug, but a gotcha nonetheless:

@@ -270,9 +304,10 @@ Not really a bug, but a gotcha nonetheless:
 Requirements
 ------------

-* C++11 compiler (tested with GCC 4.9)
+* C++11 compiler (tested with GCC 4.9 and 6.2.0)

-  Sorry.  I really like closures.
+  Sorry.  I really like closures and shared_ptr, so support
+  for earlier compiler versions is unlikely.

 * btrfs-progs (tested with 4.1..4.7)

@@ -284,7 +319,7 @@ Requirements
  TODO: remove the one function used from this library.
  It supports a feature Bees no longer implements.

-* Linux kernel 4.2 or later
+* Linux kernel 4.4.3 or later

  Don't bother trying to make Bees work with older kernels.
  It won't end well.
@@ -320,17 +355,49 @@ of 16M).  This example creates a 1GB hash table:
        truncate -s 1g "$BEESHOME/beeshash.dat"
        chmod 700 "$BEESHOME/beeshash.dat"

+bees can only process the root subvol of a btrfs (seriously--if the
+argument is not the root subvol directory, Bees will just throw an
+exception and stop).
+
+Use a bind mount, and let only bees access it:
+
+	UUID=3399e413-695a-4b0b-9384-1b0ef8f6c4cd
+	mkdir -p /var/lib/bees/$UUID
+	mount /dev/disk/by-uuid/$UUID /var/lib/bees/$UUID -osubvol=/
+
+If you don't set BEESHOME, the path ".beeshome" will be used relative
+to the root subvol of the filesystem.  For example:
+
+	btrfs sub create /var/lib/bees/$UUID/.beeshome
+	truncate -s 1g /var/lib/bees/$UUID/.beeshome/beeshash.dat
+	chmod 700 /var/lib/bees/$UUID/.beeshome/beeshash.dat
+
+You can use any relative path in BEESHOME.  The path will be taken
+relative to the root of the deduped filesystem (in other words it can
+be the name of a subvol):
+
+	export BEESHOME=@my-beeshome
+	btrfs sub create /var/lib/bees/$UUID/$BEESHOME
+	truncate -s 1g /var/lib/bees/$UUID/$BEESHOME/beeshash.dat
+	chmod 700 /var/lib/bees/$UUID/$BEESHOME/beeshash.dat
+
 Configuration
 -------------

 The only runtime configurable options are environment variables:

 * BEESHOME: Directory containing Bees state files:
- * beeshash.dat         | persistent hash table (must be a multiple of 16M)
- * beescrawl.`UUID`.dat | state of SEARCH_V2 crawlers
- * beesstats.txt        | statistics and performance counters
-* BEESSTATS: File containing a snapshot of current Bees state (performance
-  counters and current status of each thread).
+ * beeshash.dat  | persistent hash table.  Must be a multiple of 16M.
+                   This contains 16-byte records:  8 bytes for CRC64,
+                   8 bytes for physical address and some metadata bits.
+ * beescrawl.dat | state of SEARCH_V2 crawlers.  ASCII text.
+ * beesstats.txt | statistics and performance counters.  ASCII text.
+* BEESSTATUS: File containing a snapshot of current Bees state:  performance
+  counters and current status of each thread.  The file is meant to be
+  human readable, but understanding it probably requires reading the source.
+  You can watch bees run in realtime with a command like:
+
+	watch -n1 cat $BEESSTATUS

 Other options (e.g. interval between filesystem crawls) can be configured
 in src/bees.h.
@@ -338,39 +405,27 @@ in src/bees.h.
 Running
 -------

-We created this directory in the previous section:
-
-        export BEESHOME=/some/path
-
-Use a tmpfs for BEESSTATUS, it updates once per second:
-
-        export BEESSTATUS=/run/bees.status
-
-bees can only process the root subvol of a btrfs (seriously--if the
-argument is not the root subvol directory, Bees will just throw an
-exception and stop).
-
-Use a bind mount, and let only bees access it:
-
-        mount -osubvol=/ /dev/<your-filesystem> /var/lib/bees/root
-
-Reduce CPU and IO priority to be kinder to other applications
-sharing this host (or raise them for more aggressive disk space
-recovery).  If you use cgroups, put `bees` in its own cgroup, then reduce
-the `blkio.weight` and `cpu.shares` parameters.  You can also use
-`schedtool` and `ionice` in the shell script that launches `bees`:
+Reduce CPU and IO priority to be kinder to other applications sharing
+this host (or raise them for more aggressive disk space recovery).  If you
+use cgroups, put `bees` in its own cgroup, then reduce the `blkio.weight`
+and `cpu.shares` parameters.  You can also use `schedtool` and `ionice`
+in the shell script that launches `bees`:

        schedtool -D -n20 $$
        ionice -c3 -p $$

 Let the bees fly:

-        bees /var/lib/bees/root >> /var/log/bees.log 2>&1
+	for fs in /var/lib/bees/*-*-*-*-*/; do
+		bees "$fs" >> "$fs/.beeshome/bees.log" 2>&1 &
+	done

 You'll probably want to arrange for /var/log/bees.log to be rotated
 periodically.  You may also want to set umask to 077 to prevent disclosure
 of information about the contents of the filesystem through the log file.

+There are also some shell wrappers in the `scripts/` directory.
+

 Bug Reports and Contributions
 -----------------------------
--- a/include/crucible/cache.h
+++ b/include/crucible/cache.h
@@ -8,6 +8,7 @@
 #include <map>
 #include <mutex>
 #include <tuple>
+#include <vector>

 namespace crucible {
 	using namespace std;
--- a/include/crucible/chatter.h
+++ b/include/crucible/chatter.h
@@ -86,16 +86,6 @@ namespace crucible {
 		}
 	};

-	template <>
-	struct ChatterTraits<ostream &> {
-		Chatter &
-		operator()(Chatter &c, ostream & arg)
-		{
-			c.get_os() << arg;
-			return c;
-		}
-	};
-
 	class ChatterBox {
 		string m_file;
 		int m_line;
--- a/include/crucible/crc64.h
+++ b/include/crucible/crc64.h
@@ -3,11 +3,11 @@

 #include <cstdint>
 #include <cstdlib>
+#include <cstring>

 namespace crucible {
 	namespace Digest {
 		namespace CRC {
-			uint64_t crc64(const char *s);
 			uint64_t crc64(const void *p, size_t len);
 		};
 	};
--- a/include/crucible/fd.h
+++ b/include/crucible/fd.h
@@ -70,10 +70,11 @@ namespace crucible {
 	string mmap_flags_ntoa(int flags);

 	// Unlink, rename
-	void unlink_or_die(const string &file);
 	void rename_or_die(const string &from, const string &to);
 	void renameat_or_die(int fromfd, const string &frompath, int tofd, const string &topath);

+	void ftruncate_or_die(int fd, off_t size);
+
 	// Read or write structs:
 	// There is a template specialization to read or write strings
 	// Three-arg version of read_or_die/write_or_die throws an error on incomplete read/writes
@@ -120,6 +121,9 @@ namespace crucible {
 	template<> void pread_or_die<string>(int fd, string& str, off_t offset);
 	template<> void pread_or_die<vector<char>>(int fd, vector<char>& str, off_t offset);
 	template<> void pread_or_die<vector<uint8_t>>(int fd, vector<uint8_t>& str, off_t offset);
+	template<> void pwrite_or_die<string>(int fd, const string& str, off_t offset);
+	template<> void pwrite_or_die<vector<char>>(int fd, const vector<char>& str, off_t offset);
+	template<> void pwrite_or_die<vector<uint8_t>>(int fd, const vector<uint8_t>& str, off_t offset);

 	// A different approach to reading a simple string
 	string read_string(int fd, size_t size);
--- a/include/crucible/fs.h
+++ b/include/crucible/fs.h
@@ -13,6 +13,7 @@

 #include <cstdint>
 #include <iosfwd>
+#include <set>
 #include <vector>

 #include <fcntl.h>
@@ -150,13 +151,14 @@ namespace crucible {
 		BtrfsIoctlSearchHeader();
 		vector<char> m_data;
 		size_t set_data(const vector<char> &v, size_t offset);
+		bool operator<(const BtrfsIoctlSearchHeader &that) const;
 	};

 	ostream & operator<<(ostream &os, const btrfs_ioctl_search_header &hdr);
 	ostream & operator<<(ostream &os, const BtrfsIoctlSearchHeader &hdr);

 	struct BtrfsIoctlSearchKey : public btrfs_ioctl_search_key {
-		BtrfsIoctlSearchKey(size_t buf_size = 1024 * 1024);
+		BtrfsIoctlSearchKey(size_t buf_size = 4096);
 		virtual bool do_ioctl_nothrow(int fd);
 		virtual void do_ioctl(int fd);

@@ -164,14 +166,15 @@ namespace crucible {
 		void next_min(const BtrfsIoctlSearchHeader& ref);

 		size_t m_buf_size;
-		vector<BtrfsIoctlSearchHeader> m_result;
+		set<BtrfsIoctlSearchHeader> m_result;
+
 	};

 	ostream & operator<<(ostream &os, const btrfs_ioctl_search_key &key);
 	ostream & operator<<(ostream &os, const BtrfsIoctlSearchKey &key);

 	string btrfs_search_type_ntoa(unsigned type);
-	string btrfs_search_objectid_ntoa(unsigned objectid);
+	string btrfs_search_objectid_ntoa(uint64_t objectid);

 	uint64_t btrfs_get_root_id(int fd);
 	uint64_t btrfs_get_root_transid(int fd);
--- a/include/crucible/ntoa.h
+++ b/include/crucible/ntoa.h
@@ -7,12 +7,12 @@ namespace crucible {
 	using namespace std;

 	struct bits_ntoa_table {
-		unsigned long n;
-		unsigned long mask;
+		unsigned long long n;
+		unsigned long long mask;
 		const char *a;
 	};

-	string bits_ntoa(unsigned long n, const bits_ntoa_table *a);
+	string bits_ntoa(unsigned long long n, const bits_ntoa_table *a);

 };

--- a/include/crucible/timequeue.h
+++ b/include/crucible/timequeue.h
@@ -23,7 +23,7 @@ namespace crucible {
 	private:
 		struct Item {
 			Timestamp m_time;
-			unsigned m_id;
+			unsigned long m_id;
 			Task m_task;

 			bool operator<(const Item &that) const {
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,8 +18,6 @@ OBJS = \

 include ../makeflags

-LDFLAGS = -shared -luuid
-
 depends.mk: *.c *.cc
 	for x in *.c; do $(CC) $(CFLAGS) -M "$$x"; done > depends.mk.new
 	for x in *.cc; do $(CXX) $(CXXFLAGS) -M "$$x"; done >> depends.mk.new
@@ -34,4 +32,4 @@ depends.mk: *.c *.cc
 	$(CXX) $(CXXFLAGS) -o $@ -c $<

 libcrucible.so: $(OBJS) Makefile
-	$(CXX) $(LDFLAGS) -o $@ $(OBJS)
+	$(CXX) $(LDFLAGS) -o $@ $(OBJS) -shared -luuid
--- a/lib/chatter.cc
+++ b/lib/chatter.cc
@@ -15,7 +15,7 @@
 namespace crucible {
 	using namespace std;

-	static auto_ptr<set<string>> chatter_names;
+	static shared_ptr<set<string>> chatter_names;
 	static const char *SPACETAB = " \t";

 	static
--- a/lib/crc64.cc
+++ b/lib/crc64.cc
@@ -1,3 +1,31 @@
+/* crc64.c -- compute CRC-64
+ * Copyright (C) 2013 Mark Adler
+ * Version 1.4  16 Dec 2013  Mark Adler
+ */
+
+/*
+ This software is provided 'as-is', without any express or implied
+ warranty.  In no event will the author be held liable for any damages
+ arising from the use of this software.
+
+ Permission is granted to anyone to use this software for any purpose,
+ including commercial applications, and to alter it and redistribute it
+ freely, subject to the following restrictions:
+
+ 1. The origin of this software must not be misrepresented; you must not
+ claim that you wrote the original software. If you use this software
+ in a product, an acknowledgment in the product documentation would be
+ appreciated but is not required.
+ 2. Altered source versions must be plainly marked as such, and must not be
+ misrepresented as being the original software.
+ 3. This notice may not be removed or altered from any source distribution.
+
+ Mark Adler
+ madler@alumni.caltech.edu
+ */
+
+/* Substantially modified by Paul Jones for usage in bees */
+
 #include "crucible/crc64.h"

 #define POLY64REV 0xd800000000000000ULL
@@ -5,13 +33,16 @@
 namespace crucible {

 	static bool init = false;
-	static uint64_t CRCTable[256];
+	static uint64_t CRCTable[8][256];

 	static void init_crc64_table()
 	{
 		if (!init) {
-			for (int i = 0; i <= 255; i++) {
-				uint64_t part = i;
+			uint64_t crc;
+
+			// Generate CRCs for all single byte sequences
+			for (int n = 0; n < 256; n++) {
+				uint64_t part = n;
 				for (int j = 0; j < 8; j++) {
 					if (part & 1) {
 						part = (part >> 1) ^ POLY64REV;
@@ -19,37 +50,53 @@ namespace crucible {
 						part >>= 1;
 					}
 				}
-				CRCTable[i] = part;
+				CRCTable[0][n] = part;
+			}
+
+			// Generate nested CRC table for slice-by-8 lookup
+			for (int n = 0; n < 256; n++) {
+				crc = CRCTable[0][n];
+				for (int k = 1; k < 8; k++) {
+					crc = CRCTable[0][crc & 0xff] ^ (crc >> 8);
+					CRCTable[k][n] = crc;
+				}
 			}
 			init = true;
 		}
 	}

-	uint64_t
-	Digest::CRC::crc64(const char *s)
-	{
-		init_crc64_table();
-
-		uint64_t crc = 0;
-		for (; *s; s++) {
-			uint64_t temp1 = crc >> 8;
-			uint64_t temp2 = CRCTable[(crc ^ static_cast<uint64_t>(*s)) & 0xff];
-			crc = temp1 ^ temp2;
-		}
-
-		return crc;
-	}
-
 	uint64_t
 	Digest::CRC::crc64(const void *p, size_t len)
 	{
 		init_crc64_table();
-
+		const unsigned char *next = static_cast<const unsigned char *>(p);
 		uint64_t crc = 0;
-		for (const unsigned char *s = static_cast<const unsigned char *>(p); len; --len) {
-			uint64_t temp1 = crc >> 8;
-			uint64_t temp2 = CRCTable[(crc ^ *s++) & 0xff];
-			crc = temp1 ^ temp2;
+
+		// Process individual bytes until we reach an 8-byte aligned pointer
+		while (len && (reinterpret_cast<uintptr_t>(next) & 7) != 0) {
+			crc = CRCTable[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
+			len--;
+		}
+
+		// Fast middle processing, 8 bytes (aligned!) per loop
+		while (len >= 8) {
+			crc ^= *(reinterpret_cast< const uint64_t *>(next));
+			crc = CRCTable[7][crc & 0xff] ^
+				  CRCTable[6][(crc >> 8) & 0xff] ^
+				  CRCTable[5][(crc >> 16) & 0xff] ^
+				  CRCTable[4][(crc >> 24) & 0xff] ^
+				  CRCTable[3][(crc >> 32) & 0xff] ^
+				  CRCTable[2][(crc >> 40) & 0xff] ^
+				  CRCTable[1][(crc >> 48) & 0xff] ^
+				  CRCTable[0][crc >> 56];
+			next += 8;
+			len -= 8;
+		}
+
+		// Process remaining bytes (can't be larger than 8)
+		while (len) {
+			crc = CRCTable[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
+			len--;
 		}

 		return crc;
--- a/lib/execpipe.cc
+++ b/lib/execpipe.cc
@@ -72,14 +72,10 @@ namespace crucible {
 			catch_all([&]() {
 				parent_fd->close();
 				import_fd_fn(child_fd);
-				// system("ls -l /proc/$$/fd/ >&2");

 				rv = f();
 			});
 			_exit(rv);
-			cerr << "PID " << getpid() << " TID " << gettid() << "STILL ALIVE" << endl;
-			system("ls -l /proc/$$/task/ >&2");
-			exit(EXIT_FAILURE);
 		}
 	}

--- a/lib/extentwalker.cc
+++ b/lib/extentwalker.cc
@@ -468,7 +468,7 @@ namespace crucible {
 	BtrfsExtentWalker::Vec
 	BtrfsExtentWalker::get_extent_map(off_t pos)
 	{
-		BtrfsIoctlSearchKey sk;
+		BtrfsIoctlSearchKey sk(sc_extent_fetch_max * (sizeof(btrfs_file_extent_item) + sizeof(btrfs_ioctl_search_header)));
 		if (!m_root_fd) {
 			m_root_fd = m_fd;
 		}
--- a/lib/fd.cc
+++ b/lib/fd.cc
@@ -230,6 +230,14 @@ namespace crucible {
 		}
 	}

+	void
+	ftruncate_or_die(int fd, off_t size)
+	{
+		if (::ftruncate(fd, size)) {
+			THROW_ERRNO("ftruncate: " << name_fd(fd) << " size " << size);
+		}
+	}
+
 	string
 	socket_domain_ntoa(int domain)
 	{
@@ -426,6 +434,27 @@ namespace crucible {
 		return pread_or_die(fd, text.data(), text.size(), offset);
 	}

+	template<>
+	void
+	pwrite_or_die<vector<uint8_t>>(int fd, const vector<uint8_t> &text, off_t offset)
+	{
+		return pwrite_or_die(fd, text.data(), text.size(), offset);
+	}
+
+	template<>
+	void
+	pwrite_or_die<vector<char>>(int fd, const vector<char> &text, off_t offset)
+	{
+		return pwrite_or_die(fd, text.data(), text.size(), offset);
+	}
+
+	template<>
+	void
+	pwrite_or_die<string>(int fd, const string &text, off_t offset)
+	{
+		return pwrite_or_die(fd, text.data(), text.size(), offset);
+	}
+
 	Stat::Stat()
 	{
 		memset_zero<stat>(this);
--- a/lib/fs.cc
+++ b/lib/fs.cc
@@ -707,11 +707,19 @@ namespace crucible {
 		return offset + len;
 	}

+	bool
+	BtrfsIoctlSearchHeader::operator<(const BtrfsIoctlSearchHeader &that) const
+	{
+		return tie(objectid, type, offset, len, transid) < tie(that.objectid, that.type, that.offset, that.len, that.transid);
+	}
+
 	bool
 	BtrfsIoctlSearchKey::do_ioctl_nothrow(int fd)
 	{
 		vector<char> ioctl_arg = vector_copy_struct<btrfs_ioctl_search_key>(this);
-		ioctl_arg.resize(sizeof(btrfs_ioctl_search_args_v2) + m_buf_size, 0);
+		// Normally we like to be paranoid and fill empty bytes with zero,
+		// but these buffers can be huge.  80% of a 4GHz CPU huge.
+		ioctl_arg.resize(sizeof(btrfs_ioctl_search_args_v2) + m_buf_size);
 		btrfs_ioctl_search_args_v2 *ioctl_ptr = reinterpret_cast<btrfs_ioctl_search_args_v2 *>(ioctl_arg.data());

 		ioctl_ptr->buf_size = m_buf_size;
@@ -725,13 +733,12 @@ namespace crucible {
 		static_cast<btrfs_ioctl_search_key&>(*this) = ioctl_ptr->key;

 		m_result.clear();
-		m_result.reserve(nr_items);

 		size_t offset = pointer_distance(ioctl_ptr->buf, ioctl_ptr);
 		for (decltype(nr_items) i = 0; i < nr_items; ++i) {
 			BtrfsIoctlSearchHeader item;
 			offset = item.set_data(ioctl_arg, offset);
-			m_result.push_back(item);
+			m_result.insert(item);
 		}

 		return true;
@@ -834,7 +841,7 @@ namespace crucible {
 	}

 	string
-	btrfs_search_objectid_ntoa(unsigned objectid)
+	btrfs_search_objectid_ntoa(uint64_t objectid)
 	{
 		static const bits_ntoa_table table[] = {
 			NTOA_TABLE_ENTRY_ENUM(BTRFS_ROOT_TREE_OBJECTID),
--- a/lib/ntoa.cc
+++ b/lib/ntoa.cc
@@ -7,7 +7,7 @@
 namespace crucible {
 	using namespace std;

-	string bits_ntoa(unsigned long n, const bits_ntoa_table *table)
+	string bits_ntoa(unsigned long long n, const bits_ntoa_table *table)
 	{
 		string out;
 		while (n && table->a) {
--- a/2
+++ b/2
@@ -1,4 +1,4 @@
-CCFLAGS  = -Wall -Wextra -Werror -O3 -I../include -ggdb -fpic
+CCFLAGS  = -Wall -Wextra -Werror -O3 -march=native -I../include -ggdb -fpic -D_FILE_OFFSET_BITS=64
 # CCFLAGS  = -Wall -Wextra -Werror -O0 -I../include -ggdb -fpic
 CFLAGS   = $(CCFLAGS) -std=c99
 CXXFLAGS = $(CCFLAGS) -std=c++11 -Wold-style-cast
--- a/scripts/beesd
+++ b/scripts/beesd
@@ -0,0 +1,106 @@
+#!/bin/bash
+# /usr/bin/beesd
+
+## Helpful functions
+INFO(){ echo "INFO:" "$@"; }
+ERRO(){ echo "ERROR:" "$@"; exit 1; }
+YN(){ [[ "$1" =~ (1|Y|y) ]]; }
+
+## Global vars
+export BEESHOME BEESSTATUS
+export WORK_DIR CONFIG_DIR
+export CONFIG_FILE
+export UUID AL16M
+
+readonly AL16M="$((16*1024*1024))"
+readonly CONFIG_DIR=/etc/bees/
+
+## Pre checks
+{
+    [ ! -d "$CONFIG_DIR" ] && ERRO "Missing: $CONFIG_DIR"
+    [ "$UID" == "0" ] || ERRO "Must be runned as root"
+}
+
+command -v bees &> /dev/null || ERRO "Missing 'bees' command"
+
+## Parse args
+UUID="$1"
+case "$UUID" in
+    *-*-*-*-*)
+        FILE_CONFIG=""
+        for file in "$CONFIG_DIR"/*.conf; do
+            [ ! -f "$file" ] && continue
+            if grep -q "$UUID" "$file"; then
+                INFO "Find $UUID in $file, use as conf"
+                FILE_CONFIG="$file"
+            fi
+        done
+        [ ! -f "$FILE_CONFIG" ] && ERRO "No config for $UUID"
+        source "$FILE_CONFIG"
+    ;;
+    *)
+        echo "beesd <btrfs_uuid>"
+        exit 1
+    ;;
+esac
+
+WORK_DIR="${WORK_DIR:-/run/bees/}"
+MNT_DIR="${MNT_DIR:-$WORK_DIR/mnt/$UUID}"
+BEESHOME="${BEESHOME:-$MNT_DIR/.beeshome}"
+BEESSTATUS="${BEESSTATUS:-$WORK_DIR/$UUID.status}"
+DB_SIZE="${DB_SIZE:-$((64*AL16M))}"
+LOG_SHORT_PATH="${LOG_SHORT_PATH:-N}"
+
+INFO "Check: BTRFS UUID exists"
+if [ ! -d "/sys/fs/btrfs/$UUID" ]; then
+    ERRO "Can't find BTRFS UUID: $UUID"
+fi
+
+INFO "Check: Disk exists"
+if [ ! -b "/dev/disk/by-uuid/$UUID" ]; then
+    ERRO "Missing disk: /dev/disk/by-uuid/$UUID"
+fi
+
+INFO "WORK DIR: $WORK_DIR"
+mkdir -p "$WORK_DIR" || exit 1
+
+INFO "MOUNT DIR: $MNT_DIR"
+mkdir -p "$MNT_DIR" || exit 1
+
+umount_w(){ mountpoint -q "$1" && umount -l "$1"; }
+force_umount(){ umount_w "$MNT_DIR"; }
+trap force_umount SIGINT SIGTERM EXIT
+
+mount -osubvolid=5 /dev/disk/by-uuid/$UUID "$MNT_DIR" || exit 1
+
+if [ ! -d "$BEESHOME" ]; then
+    INFO "Create subvol $BEESHOME for store bees data"
+    btrfs sub cre "$BEESHOME"
+else
+    btrfs sub show "$BEESHOME" &> /dev/null || ERRO "$BEESHOME MUST BE A SUBVOL!"
+fi
+
+# Check DB size
+{
+    DB_PATH="$BEESHOME/beeshash.dat"
+    touch "$DB_PATH"
+    OLD_SIZE="$(du -b "$DB_PATH" | sed 's/\t/ /g' | cut -d' ' -f1)"
+    NEW_SIZE="$DB_SIZE"
+    if (( "$NEW_SIZE"%AL16M > 0 )); then
+        ERRO "DB_SIZE Must be multiple of 16M"
+    fi
+    if (( "$OLD_SIZE" != "$NEW_SIZE" )); then
+        INFO "Resize db: $OLD_SIZE -> $NEW_SIZE"
+        [ -f "$BEESHOME/beescrawl.$UUID.dat" ] && rm "$BEESHOME/beescrawl.$UUID.dat"
+        truncate -s $NEW_SIZE $DB_PATH
+    fi
+    chmod 700 "$DB_PATH"
+}
+
+if YN "$LOG_SHORT_PATH"; then
+    cd "$MNT_DIR" || exit 1
+    bees .
+else
+    bees "$MNT_DIR"
+fi
+exit 0
--- a/scripts/beesd.conf.sample
+++ b/scripts/beesd.conf.sample
@@ -0,0 +1,31 @@
+## Config for Bees: /etc/bees/beesd.conf.sample
+## https://github.com/Zygo/bees
+## It's a default values, change it, if needed
+
+# Which FS will be used
+UUID=5d3c0ad5-bedf-463d-8235-b4d4f6f99476
+
+## System Vars
+# Change carefully
+# WORK_DIR=/run/bees/
+# MNT_DIR="$WORK_DIR/mnt/$UUID"
+# BEESHOME="$MNT_DIR/.beeshome"
+# BEESSTATUS="$WORK_DIR/$UUID.status"
+
+## Make path shorter in logs
+# LOG_SHORT_PATH=N
+
+## Bees DB size
+# Hash Table Sizing
+# sHash table entries are 16 bytes each
+# (64-bit hash, 52-bit block number, and some metadata bits)
+# Each entry represents a minimum of 4K on disk.
+# unique data size    hash table size    average dedup block size
+#     1TB                 4GB                  4K
+#     1TB                 1GB                 16K
+#     1TB               256MB                 64K
+#     1TB                16MB               1024K
+#    64TB                 1GB               1024K
+#
+# Size MUST be power of 16M
+# DB_SIZE=$((64*$AL16M)) # 1G in bytes
--- a/scripts/beesd@.service
+++ b/scripts/beesd@.service
@@ -0,0 +1,14 @@
+[Unit]
+Description=Bees - Best-Effort Extent-Same, a btrfs deduplicator daemon: %i
+After=local-fs.target
+
+[Service]
+ExecStart=/usr/bin/beesd %i
+Nice=19
+IOSchedulingClass=idle
+CPUAccounting=true
+MemoryAccounting=true
+# CPUQuota=95%
+
+[Install]
+WantedBy=local-fs.target
--- a/src/.gitignore
+++ b/src/.gitignore
@@ -0,0 +1 @@
+bees-version.h
--- a/src/Makefile
+++ b/src/Makefile
@@ -11,6 +11,8 @@ LIBS = -lcrucible -lpthread
 LDFLAGS = -L../lib -Wl,-rpath=$(shell realpath ../lib)

 depends.mk: Makefile *.cc
+	echo "#define BEES_VERSION \"$(shell git describe --always --dirty || echo UNKNOWN)\"" > bees-version.new.h
+	mv -f bees-version.new.h bees-version.h
 	for x in *.cc; do $(CXX) $(CXXFLAGS) -M "$$x"; done > depends.mk.new
 	mv -fv depends.mk.new depends.mk

@@ -36,4 +38,4 @@ BEES_OBJS = \
 	$(CXX) $(CXXFLAGS) -o "$@" $(BEES_OBJS) $(LDFLAGS) $(LIBS)

 clean:
-	-rm -fv *.o
+	-rm -fv *.o bees-version.h
--- a/src/bees-context.cc
+++ b/src/bees-context.cc
@@ -5,6 +5,7 @@

 #include <fstream>
 #include <iostream>
+#include <vector>

 using namespace crucible;
 using namespace std;
@@ -23,10 +24,16 @@ getenv_or_die(const char *name)
 BeesFdCache::BeesFdCache()
 {
 	m_root_cache.func([&](shared_ptr<BeesContext> ctx, uint64_t root) -> Fd {
-		return ctx->roots()->open_root_nocache(root);
+		Timer open_timer;
+		auto rv = ctx->roots()->open_root_nocache(root);
+		BEESCOUNTADD(open_root_ms, open_timer.age() * 1000);
+		return rv;
 	});
 	m_file_cache.func([&](shared_ptr<BeesContext> ctx, uint64_t root, uint64_t ino) -> Fd {
-		return ctx->roots()->open_root_ino_nocache(root, ino);
+		Timer open_timer;
+		auto rv = ctx->roots()->open_root_ino_nocache(root, ino);
+		BEESCOUNTADD(open_ino_ms, open_timer.age() * 1000);
+		return rv;
 	});
 }

@@ -228,15 +235,24 @@ BeesContext::show_progress()
 	}
 }

+Fd
+BeesContext::home_fd()
+{
+	const char *base_dir = getenv("BEESHOME");
+	if (!base_dir) {
+		base_dir = ".beeshome";
+	}
+	m_home_fd = openat(root_fd(), base_dir, FLAGS_OPEN_DIR);
+	if (!m_home_fd) {
+		THROW_ERRNO("openat: " << name_fd(root_fd()) << " / " << base_dir);
+	}
+	return m_home_fd;
+}
+
 BeesContext::BeesContext(shared_ptr<BeesContext> parent) :
 	m_parent_ctx(parent)
 {
-	auto base_dir = getenv_or_die("BEESHOME");
-	BEESLOG("BEESHOME = " << base_dir);
-	m_home_fd = open_or_die(base_dir, FLAGS_OPEN_DIR);
 	if (m_parent_ctx) {
-		m_hash_table = m_parent_ctx->hash_table();
-		m_hash_table->set_shared(true);
 		m_fd_cache = m_parent_ctx->fd_cache();
 	}
 }
--- a/src/bees-hash.cc
+++ b/src/bees-hash.cc
@@ -1,3 +1,4 @@
+#include "bees-version.h"
 #include "bees.h"

 #include "crucible/crc64.h"
@@ -11,13 +12,6 @@
 using namespace crucible;
 using namespace std;

-static inline
-bool
-using_any_madvise()
-{
-	return true;
-}
-
 ostream &
 operator<<(ostream &os, const BeesHash &bh)
 {
@@ -101,8 +95,6 @@ BeesHashTable::get_extent_range(HashType hash)
 void
 BeesHashTable::flush_dirty_extents()
 {
-	if (using_shared_map()) return;
-
 	THROW_CHECK1(runtime_error, m_buckets, m_buckets > 0);

 	unique_lock<mutex> lock(m_extent_mutex);
@@ -124,16 +116,12 @@ BeesHashTable::flush_dirty_extents()
 			uint8_t *dirty_extent_end = m_extent_ptr[extent_number + 1].p_byte;
 			THROW_CHECK1(out_of_range, dirty_extent,     dirty_extent     >= m_byte_ptr);
 			THROW_CHECK1(out_of_range, dirty_extent_end, dirty_extent_end <= m_byte_ptr_end);
-			if (using_shared_map()) {
-				BEESTOOLONG("flush extent " << extent_number);
-				copy(dirty_extent, dirty_extent_end, dirty_extent);
-			} else {
-				BEESTOOLONG("pwrite(fd " << m_fd << " '" << name_fd(m_fd)<< "', length " << to_hex(dirty_extent_end - dirty_extent) << ", offset " << to_hex(dirty_extent - m_byte_ptr) << ")");
-				// Page locks slow us down more than copying the data does
-				vector<uint8_t> extent_copy(dirty_extent, dirty_extent_end);
-				pwrite_or_die(m_fd, extent_copy, dirty_extent - m_byte_ptr);
-				BEESCOUNT(hash_extent_out);
-			}
+			THROW_CHECK2(out_of_range, dirty_extent_end, dirty_extent, dirty_extent_end - dirty_extent == BLOCK_SIZE_HASHTAB_EXTENT);
+			BEESTOOLONG("pwrite(fd " << m_fd << " '" << name_fd(m_fd)<< "', length " << to_hex(dirty_extent_end - dirty_extent) << ", offset " << to_hex(dirty_extent - m_byte_ptr) << ")");
+			// Page locks slow us down more than copying the data does
+			vector<uint8_t> extent_copy(dirty_extent, dirty_extent_end);
+			pwrite_or_die(m_fd, extent_copy, dirty_extent - m_byte_ptr);
+			BEESCOUNT(hash_extent_out);
 		});
 		BEESNOTE("flush rate limited at extent #" << extent_number << " (" << extent_counter << " of " << dirty_extent_copy.size() << ")");
 		m_flush_rate_limit.sleep_for(BLOCK_SIZE_HASHTAB_EXTENT);
@@ -143,7 +131,6 @@ BeesHashTable::flush_dirty_extents()
 void
 BeesHashTable::set_extent_dirty(HashType hash)
 {
-	if (using_shared_map()) return;
 	THROW_CHECK1(runtime_error, m_buckets, m_buckets > 0);
 	auto pr = get_extent_range(hash);
 	uint64_t extent_number = reinterpret_cast<Extent *>(pr.first) - m_extent_ptr;
@@ -156,10 +143,8 @@ BeesHashTable::set_extent_dirty(HashType hash)
 void
 BeesHashTable::writeback_loop()
 {
-	if (!using_shared_map()) {
-		while (1) {
-			flush_dirty_extents();
-		}
+	while (true) {
+		flush_dirty_extents();
 	}
 }

@@ -275,6 +260,7 @@ BeesHashTable::prefetch_loop()

 		graph_blob << "Now:     " << format_time(time(NULL)) << "\n";
 		graph_blob << "Uptime:  " << m_ctx->total_timer().age() << " seconds\n";
+		graph_blob << "Version: " << BEES_VERSION << "\n";

 		graph_blob 
 			<< "\nHash table page occupancy histogram (" << occupied_count << "/" << total_count << " cells occupied, " << (occupied_count * 100 / total_count) << "%)\n" 
@@ -310,7 +296,6 @@ void
 BeesHashTable::fetch_missing_extent(HashType hash)
 {
 	BEESTOOLONG("fetch_missing_extent for hash " << to_hex(hash));
-	if (using_shared_map()) return;
 	THROW_CHECK1(runtime_error, m_buckets, m_buckets > 0);
 	auto pr = get_extent_range(hash);
 	uint64_t extent_number = reinterpret_cast<Extent *>(pr.first) - m_extent_ptr;
@@ -396,7 +381,6 @@ BeesHashTable::find_cell(HashType hash)
 void
 BeesHashTable::erase_hash_addr(HashType hash, AddrType addr)
 {
-	// if (m_shared) return;
 	fetch_missing_extent(hash);
 	BEESTOOLONG("erase hash " << to_hex(hash) << " addr " << addr);
 	unique_lock<mutex> lock(m_bucket_mutex);
@@ -574,12 +558,36 @@ BeesHashTable::try_mmap_flags(int flags)
 }

 void
-BeesHashTable::set_shared(bool shared)
+BeesHashTable::open_file()
 {
-	m_shared = shared;
+	// OK open hash table
+	BEESNOTE("opening hash table '" << m_filename << "' target size " << m_size << " (" << pretty(m_size) << ")");
+
+	// Try to open existing hash table
+	Fd new_fd = openat(m_ctx->home_fd(), m_filename.c_str(), FLAGS_OPEN_FILE_RW, 0700);
+
+	// If that doesn't work, try to make a new one
+	if (!new_fd) {
+		string tmp_filename = m_filename + ".tmp";
+		BEESLOGNOTE("creating new hash table '" << tmp_filename << "'");
+		unlinkat(m_ctx->home_fd(), tmp_filename.c_str(), 0);
+		new_fd = openat_or_die(m_ctx->home_fd(), tmp_filename, FLAGS_CREATE_FILE, 0700);
+		BEESLOGNOTE("truncating new hash table '" << tmp_filename << "' size " << m_size << " (" << pretty(m_size) << ")");
+		ftruncate_or_die(new_fd, m_size);
+		BEESLOGNOTE("truncating new hash table '" << tmp_filename << "' -> '" << m_filename << "'");
+		renameat_or_die(m_ctx->home_fd(), tmp_filename, m_ctx->home_fd(), m_filename);
+	}
+
+	Stat st(new_fd);
+	off_t new_size = st.st_size;
+
+	THROW_CHECK1(invalid_argument, new_size, new_size > 0);
+	THROW_CHECK1(invalid_argument, new_size, (new_size % BLOCK_SIZE_HASHTAB_EXTENT) == 0);
+	m_size = new_size;
+	m_fd = new_fd;
 }

-BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename) :
+BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t size) :
 	m_ctx(ctx),
 	m_size(0),
 	m_void_ptr(nullptr),
@@ -587,35 +595,30 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename) :
 	m_buckets(0),
 	m_cells(0),
 	m_writeback_thread("hash_writeback"),
-	m_prefetch_thread("hash_prefetch " + m_ctx->root_path()),
+	m_prefetch_thread("hash_prefetch"),
 	m_flush_rate_limit(BEES_FLUSH_RATE),
 	m_prefetch_rate_limit(BEES_FLUSH_RATE),
 	m_stats_file(m_ctx->home_fd(), "beesstats.txt")
 {
-	BEESNOTE("opening hash table " << filename);
-
-	m_fd = openat_or_die(m_ctx->home_fd(), filename, FLAGS_OPEN_FILE_RW, 0700);
-	Stat st(m_fd);
-	m_size = st.st_size;
-
-	BEESTRACE("hash table size " << m_size);
-	BEESTRACE("hash table bucket size " << BLOCK_SIZE_HASHTAB_BUCKET);
-	BEESTRACE("hash table extent size " << BLOCK_SIZE_HASHTAB_EXTENT);
-
+	// Sanity checks to protect the implementation from its weaknesses
 	THROW_CHECK2(invalid_argument, BLOCK_SIZE_HASHTAB_BUCKET, BLOCK_SIZE_HASHTAB_EXTENT, (BLOCK_SIZE_HASHTAB_EXTENT % BLOCK_SIZE_HASHTAB_BUCKET) == 0);

-	// Does the union work?
-	THROW_CHECK2(runtime_error, m_void_ptr, m_cell_ptr, m_void_ptr == m_cell_ptr);
-	THROW_CHECK2(runtime_error, m_void_ptr, m_byte_ptr, m_void_ptr == m_byte_ptr);
-	THROW_CHECK2(runtime_error, m_void_ptr, m_bucket_ptr, m_void_ptr == m_bucket_ptr);
-	THROW_CHECK2(runtime_error, m_void_ptr, m_extent_ptr, m_void_ptr == m_extent_ptr);
-
 	// There's more than one union
 	THROW_CHECK2(runtime_error, sizeof(Bucket), BLOCK_SIZE_HASHTAB_BUCKET, BLOCK_SIZE_HASHTAB_BUCKET == sizeof(Bucket));
 	THROW_CHECK2(runtime_error, sizeof(Bucket::p_byte), BLOCK_SIZE_HASHTAB_BUCKET, BLOCK_SIZE_HASHTAB_BUCKET == sizeof(Bucket::p_byte));
 	THROW_CHECK2(runtime_error, sizeof(Extent), BLOCK_SIZE_HASHTAB_EXTENT, BLOCK_SIZE_HASHTAB_EXTENT == sizeof(Extent));
 	THROW_CHECK2(runtime_error, sizeof(Extent::p_byte), BLOCK_SIZE_HASHTAB_EXTENT, BLOCK_SIZE_HASHTAB_EXTENT == sizeof(Extent::p_byte));

+	m_filename = filename;
+	m_size = size;
+	open_file();
+
+	// Now we know size we can compute stuff
+
+	BEESTRACE("hash table size " << m_size);
+	BEESTRACE("hash table bucket size " << BLOCK_SIZE_HASHTAB_BUCKET);
+	BEESTRACE("hash table extent size " << BLOCK_SIZE_HASHTAB_EXTENT);
+
 	BEESLOG("opened hash table filename '" << filename << "' length " << m_size);
 	m_buckets = m_size / BLOCK_SIZE_HASHTAB_BUCKET;
 	m_cells = m_buckets * c_cells_per_bucket;
@@ -624,29 +627,32 @@ BeesHashTable::BeesHashTable(shared_ptr<BeesContext> ctx, string filename) :

 	BEESLOG("\tflush rate limit " << BEES_FLUSH_RATE);

-	if (using_shared_map()) {
-		try_mmap_flags(MAP_SHARED);
-	} else {
-		try_mmap_flags(MAP_PRIVATE | MAP_ANONYMOUS);
-	}
+	// Try to mmap that much memory
+	try_mmap_flags(MAP_PRIVATE | MAP_ANONYMOUS);

 	if (!m_cell_ptr) {
-		THROW_ERROR(runtime_error, "unable to mmap " << filename);
+		THROW_ERRNO("unable to mmap " << filename);
 	}

-	if (!using_shared_map()) {
-		// madvise fails if MAP_SHARED
-		if (using_any_madvise()) {
-			// DONTFORK because we sometimes do fork,
-			// but the child doesn't touch any of the many, many pages
-			BEESTOOLONG("madvise(MADV_HUGEPAGE | MADV_DONTFORK)");
-			DIE_IF_NON_ZERO(madvise(m_byte_ptr, m_size, MADV_HUGEPAGE | MADV_DONTFORK));
-		}
-		for (uint64_t i = 0; i < m_size / sizeof(Extent); ++i) {
-			m_buckets_missing.insert(i);
+	// Do unions work the way we think (and rely on)?
+	THROW_CHECK2(runtime_error, m_void_ptr, m_cell_ptr, m_void_ptr == m_cell_ptr);
+	THROW_CHECK2(runtime_error, m_void_ptr, m_byte_ptr, m_void_ptr == m_byte_ptr);
+	THROW_CHECK2(runtime_error, m_void_ptr, m_bucket_ptr, m_void_ptr == m_bucket_ptr);
+	THROW_CHECK2(runtime_error, m_void_ptr, m_extent_ptr, m_void_ptr == m_extent_ptr);
+
+	{
+		// It's OK if this fails (e.g. kernel not built with CONFIG_TRANSPARENT_HUGEPAGE)
+		// We don't fork any more so DONTFORK isn't really needed
+		BEESTOOLONG("madvise(MADV_HUGEPAGE | MADV_DONTFORK)");
+		if (madvise(m_byte_ptr, m_size, MADV_HUGEPAGE | MADV_DONTFORK)) {
+			BEESLOG("mostly harmless: madvise(MADV_HUGEPAGE | MADV_DONTFORK) failed: " << strerror(errno));
 		}
 	}

+	for (uint64_t i = 0; i < m_size / sizeof(Extent); ++i) {
+		m_buckets_missing.insert(i);
+	}
+
 	m_writeback_thread.exec([&]() {
 		writeback_loop();
        });
--- a/src/bees-resolve.cc
+++ b/src/bees-resolve.cc
@@ -196,7 +196,7 @@ BeesResolver::chase_extent_ref(const BtrfsInodeOffsetRoot &bior, BeesBlockData &

 	Fd file_fd = m_ctx->roots()->open_root_ino(bior.m_root, bior.m_inum);
 	if (!file_fd) {
-		// Delete snapshots generate craptons of these
+		// Deleted snapshots generate craptons of these
 		// BEESINFO("No FD in chase_extent_ref " << bior);
 		BEESCOUNT(chase_no_fd);
 		return BeesFileRange();
@@ -378,7 +378,10 @@ BeesResolver::for_each_extent_ref(BeesBlockData bbd, function<bool(const BeesFil
 				// We have reliable block addresses now, so we guarantee we can hit the desired block.
 				// Failure in chase_extent_ref means we are done, and don't need to look up all the
 				// other references.
-				stop_now = true;
+				// Or...not?  If we have a compressed extent, some refs will not match
+				// if there is are two references to the same extent with a reference
+				// to a different extent between them.
+				// stop_now = true;
 			}
 		});

@@ -477,11 +480,6 @@ BeesResolver::find_all_matches(BeesBlockData &bbd)
 bool
 BeesResolver::operator<(const BeesResolver &that) const
 {
-	if (that.m_bior_count < m_bior_count) {
-		return true;
-	} else if (m_bior_count < that.m_bior_count) {
-		return false;
-	}
-	return m_addr < that.m_addr;
+	// Lowest count, highest address
+	return tie(that.m_bior_count, m_addr) < tie(m_bior_count, that.m_addr);
 }
-
--- a/src/bees-roots.cc
+++ b/src/bees-roots.cc
@@ -42,17 +42,26 @@ BeesCrawlState::BeesCrawlState() :
 bool
 BeesCrawlState::operator<(const BeesCrawlState &that) const
 {
-	return tie(m_root, m_objectid, m_offset, m_min_transid, m_max_transid)
-		< tie(that.m_root, that.m_objectid, that.m_offset, that.m_min_transid, that.m_max_transid);
+	return tie(m_objectid, m_offset, m_root, m_min_transid, m_max_transid)
+		< tie(that.m_objectid, that.m_offset, that.m_root, that.m_min_transid, that.m_max_transid);
 }

 string
 BeesRoots::crawl_state_filename() const
 {
 	string rv;
+
+	// Legacy filename included UUID
 	rv += "beescrawl.";
 	rv += m_ctx->root_uuid();
 	rv += ".dat";
+
+	struct stat buf;
+	if (fstatat(m_ctx->home_fd(), rv.c_str(), &buf, AT_SYMLINK_NOFOLLOW)) {
+		// Use new filename
+		rv = "beescrawl.dat";
+	}
+
 	return rv;
 }

@@ -101,6 +110,12 @@ BeesRoots::state_save()

 	m_crawl_state_file.write(ofs.str());

+	// Renaming things is hard after release
+	if (m_crawl_state_file.name() != "beescrawl.dat") {
+		renameat(m_ctx->home_fd(), m_crawl_state_file.name().c_str(), m_ctx->home_fd(), "beescrawl.dat");
+		m_crawl_state_file.name("beescrawl.dat");
+	}
+
 	BEESNOTE("relocking crawl state");
 	lock.lock();
 	// Not really correct but probably close enough
@@ -193,15 +208,15 @@ BeesRoots::crawl_roots()
 	auto crawl_map_copy = m_root_crawl_map;
 	lock.unlock();

+#if 0
+	// Scan the same inode/offset tuple in each subvol (good for snapshots)
 	BeesFileRange first_range;
 	shared_ptr<BeesCrawl> first_crawl;
 	for (auto i : crawl_map_copy) {
 		auto this_crawl = i.second;
 		auto this_range = this_crawl->peek_front();
 		if (this_range) {
-			auto tuple_this = make_tuple(this_range.fid().ino(), this_range.fid().root(), this_range.begin());
-			auto tuple_first = make_tuple(first_range.fid().ino(), first_range.fid().root(), first_range.begin());
-			if (!first_range || tuple_this < tuple_first) {
+			if (!first_range || this_range < first_range) {
 				first_crawl = this_crawl;
 				first_range = this_range;
 			}
@@ -219,6 +234,27 @@ BeesRoots::crawl_roots()
 		THROW_CHECK2(runtime_error, first_range, first_range_popped, first_range == first_range_popped);
 		return;
 	}
+#else
+	// Scan each subvol one extent at a time (good for continuous forward progress)
+	bool crawled = false;
+	for (auto i : crawl_map_copy) {
+		auto this_crawl = i.second;
+		auto this_range = this_crawl->peek_front();
+		if (this_range) {
+			catch_all([&]() {
+				// BEESINFO("scan_forward " << this_range);
+				m_ctx->scan_forward(this_range);
+			});
+			crawled = true;
+			BEESCOUNT(crawl_scan);
+			m_crawl_current = this_crawl->get_state();
+			auto this_range_popped = this_crawl->pop_front();
+			THROW_CHECK2(runtime_error, this_range, this_range_popped, this_range == this_range_popped);
+		}
+	}
+
+	if (crawled) return;
+#endif

 	BEESLOG("Crawl ran out of data after " << m_crawl_timer.lap() << "s, waiting for more...");
 	BEESCOUNT(crawl_done);
@@ -343,8 +379,8 @@ BeesRoots::state_load()
 BeesRoots::BeesRoots(shared_ptr<BeesContext> ctx) :
 	m_ctx(ctx),
 	m_crawl_state_file(ctx->home_fd(), crawl_state_filename()),
-	m_crawl_thread("crawl " + ctx->root_path()),
-	m_writeback_thread("crawl_writeback " + ctx->root_path())
+	m_crawl_thread("crawl"),
+	m_writeback_thread("crawl_writeback")
 {
 	m_crawl_thread.exec([&]() {
 		catch_all([&]() {
@@ -629,7 +665,7 @@ BeesCrawl::fetch_extents()

 	Timer crawl_timer;

-	BtrfsIoctlSearchKey sk;
+	BtrfsIoctlSearchKey sk(BEES_MAX_CRAWL_SIZE * (sizeof(btrfs_file_extent_item) + sizeof(btrfs_ioctl_search_header)));
 	sk.tree_id = old_state.m_root;
 	sk.min_objectid = old_state.m_objectid;
 	sk.min_type = sk.max_type = BTRFS_EXTENT_DATA_KEY;
@@ -646,7 +682,9 @@ BeesCrawl::fetch_extents()
 	{
 		BEESNOTE("searching crawl sk " << static_cast<btrfs_ioctl_search_key&>(sk));
 		BEESTOOLONG("Searching crawl sk " << static_cast<btrfs_ioctl_search_key&>(sk));
+		Timer crawl_timer;
 		ioctl_ok = sk.do_ioctl_nothrow(m_ctx->root_fd());
+		BEESCOUNTADD(crawl_ms, crawl_timer.age() * 1000);
 	}

 	if (ioctl_ok) {
--- a/src/bees.cc
+++ b/src/bees.cc
@@ -1,3 +1,4 @@
+#include "bees-version.h"
 #include "bees.h"

 #include "crucible/interp.h"
@@ -32,15 +33,12 @@ do_cmd_help(const ArgList &argv)
 		"fs-root-path MUST be the root of a btrfs filesystem tree (id 5).\n"
 		"Other directories will be rejected.\n"
 		"\n"
-		"Multiple filesystems can share a single hash table (BEESHOME)\n"
-		"but this only works well if the content of each filesystem\n"
-		"is distinct from all the others.\n"
-		"\n"
-		"Required environment variables:\n"
-		"\tBEESHOME\tPath to hash table and configuration files\n"
-		"\n"
 		"Optional environment variables:\n"
-		"\tBEESSTATUS\tFile to write status to (tmpfs recommended, e.g. /run)\n"
+		"\tBEESHOME\tPath to hash table and configuration files\n"
+		"\t\t\t(default is .beeshome/ in the root of each filesystem).\n"
+		"\n"
+		"\tBEESSTATUS\tFile to write status to (tmpfs recommended, e.g. /run).\n"
+		"\t\t\tNo status is written if this variable is unset.\n"
 		"\n"
 	<< endl;
 	return 0;
@@ -351,6 +349,18 @@ BeesStringFile::BeesStringFile(Fd dir_fd, string name, size_t limit) :
 	BEESLOG("BeesStringFile " << name_fd(m_dir_fd) << "/" << m_name << " max size " << pretty(m_limit));
 }

+void
+BeesStringFile::name(const string &new_name)
+{
+	m_name = new_name;
+}
+
+string
+BeesStringFile::name() const
+{
+	return m_name;
+}
+
 string
 BeesStringFile::read()
 {
@@ -384,8 +394,13 @@ BeesStringFile::write(string contents)
 		Fd ofd = openat_or_die(m_dir_fd, tmpname, FLAGS_CREATE_FILE, S_IRUSR | S_IWUSR);
 		BEESNOTE("writing " << tmpname << " in " << name_fd(m_dir_fd));
 		write_or_die(ofd, contents);
+#if 0
+		// This triggers too many btrfs bugs.  I wish I was kidding.
+		// Forget snapshots, balance, compression, and dedup:
+		// the system call you have to fear on btrfs is fsync().
 		BEESNOTE("fsyncing " << tmpname << " in " << name_fd(m_dir_fd));
 		DIE_IF_NON_ZERO(fsync(ofd));
+#endif
 	}
 	BEESNOTE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
 	BEESTRACE("renaming " << tmpname << " to " << m_name << " in FD " << name_fd(m_dir_fd));
@@ -489,8 +504,13 @@ BeesTempFile::make_copy(const BeesFileRange &src)

 	THROW_CHECK1(invalid_argument, src, src.size() > 0);

-	// FIXME:  don't know where these come from, but we can't handle them.
-	// Grab a trace for the log.
+	// FIEMAP used to give us garbage data, e.g. distinct adjacent
+	// extents merged into a single entry in the FIEMAP output.
+	// FIEMAP didn't stop giving us garbage data, we just stopped
+	// using FIEMAP.
+	// We shouldn't get absurdly large extents any more; however,
+	// it's still a problem if we do, so bail out and leave a trace
+	// in the log.
 	THROW_CHECK1(invalid_argument, src, src.size() < BLOCK_SIZE_MAX_TEMP_FILE);

 	realign();
@@ -548,7 +568,7 @@ bees_main(ArgList args)
 	list<shared_ptr<BeesContext>> all_contexts;
 	shared_ptr<BeesContext> bc;

-	// Subscribe to fanotify events
+	// Create a context and start crawlers
 	bool did_subscription = false;
 	for (string arg : args) {
 		catch_all([&]() {
@@ -576,6 +596,8 @@ bees_main(ArgList args)
 int
 main(int argc, const char **argv)
 {
+	cerr << "bees version " << BEES_VERSION << endl;
+
 	if (argc < 2) {
 		do_cmd_help(argv);
 		return 2;
--- a/src/bees.h
+++ b/src/bees.h
@@ -136,6 +136,8 @@ const int FLAGS_OPEN_FANOTIFY = O_RDWR | O_NOATIME | O_CLOEXEC | O_LARGEFILE;
 	} \
 } while (0)

+#define BEESLOGNOTE(x) BEESLOG(x); BEESNOTE(x)
+
 #define BEESCOUNT(stat) do { \
 	BeesStats::s_global.add_count(#stat); \
 } while (0)
@@ -374,6 +376,8 @@ public:
 	BeesStringFile(Fd dir_fd, string name, size_t limit = 1024 * 1024);
 	string read();
 	void write(string contents);
+	void name(const string &new_name);
+	string name() const;
 };

 class BeesHashTable {
@@ -407,7 +411,7 @@ public:
 		uint8_t	p_byte[BLOCK_SIZE_HASHTAB_EXTENT];
 	} __attribute__((packed));

-	BeesHashTable(shared_ptr<BeesContext> ctx, string filename);
+	BeesHashTable(shared_ptr<BeesContext> ctx, string filename, off_t size = BLOCK_SIZE_HASHTAB_EXTENT);
 	~BeesHashTable();

 	vector<Cell>	find_cell(HashType hash);
@@ -415,8 +419,6 @@ public:
 	void		erase_hash_addr(HashType hash, AddrType addr);
 	bool		push_front_hash_addr(HashType hash, AddrType addr);

-	void		set_shared(bool shared);
-
 private:
 	string		m_filename;
 	Fd		m_fd;
@@ -452,8 +454,7 @@ private:

 	LockSet<uint64_t> 	m_extent_lock_set;

-	DefaultBool		m_shared;
-
+	void open_file();
 	void writeback_loop();
 	void prefetch_loop();
 	void try_mmap_flags(int flags);
@@ -464,8 +465,6 @@ private:
 	void flush_dirty_extents();
 	bool is_toxic_hash(HashType h) const;

-	bool using_shared_map() const { return false; }
-
 	BeesHashTable(const BeesHashTable &) = delete;
 	BeesHashTable &operator=(const BeesHashTable &) = delete;
 };
@@ -714,7 +713,7 @@ public:
 	void set_root_path(string path);

 	Fd root_fd() const { return m_root_fd; }
-	Fd home_fd() const { return m_home_fd; }
+	Fd home_fd();
 	string root_path() const { return m_root_path; }
 	string root_uuid() const { return m_root_uuid; }

--- a/test/crc64.cc
+++ b/test/crc64.cc
@@ -5,18 +5,6 @@

 using namespace crucible;

-static
-void
-test_getcrc64_strings()
-{
-	assert(Digest::CRC::crc64("John") == 5942451273432301568);
-	assert(Digest::CRC::crc64("Paul") == 5838402100630913024);
-	assert(Digest::CRC::crc64("George") == 6714394476893704192);
-	assert(Digest::CRC::crc64("Ringo") == 6038837226071130112);
-	assert(Digest::CRC::crc64("") == 0);
-	assert(Digest::CRC::crc64("\377\277\300\200") == 15615382887346470912ULL);
-}
-
 static
 void
 test_getcrc64_byte_arrays()
@@ -32,7 +20,6 @@ test_getcrc64_byte_arrays()
 int
 main(int, char**)
 {
-	RUN_A_TEST(test_getcrc64_strings());
 	RUN_A_TEST(test_getcrc64_byte_arrays());

 	exit(EXIT_SUCCESS);
--- a/test/limits.cc
+++ b/test/limits.cc
@@ -141,7 +141,13 @@ test_cast_0x80000000_to_things()
 	SHOULD_FAIL(ranged_cast<unsigned short>(uv));
 	SHOULD_FAIL(ranged_cast<unsigned char>(uv));
 	SHOULD_PASS(ranged_cast<signed long long>(sv), sv);
-	SHOULD_PASS(ranged_cast<signed long>(sv), sv);
+	if (sizeof(long) == 4) {
+		SHOULD_FAIL(ranged_cast<signed long>(sv));
+	} else if (sizeof(long) == 8) {
+		SHOULD_PASS(ranged_cast<signed long>(sv), sv);
+	} else {
+		assert(!"unhandled case, please add code for long here");
+	}
 	SHOULD_FAIL(ranged_cast<signed short>(sv));
 	SHOULD_FAIL(ranged_cast<signed char>(sv));
 	if (sizeof(int) == 4) {
@@ -149,7 +155,7 @@ test_cast_0x80000000_to_things()
 	} else if (sizeof(int) == 8) {
 		SHOULD_PASS(ranged_cast<signed int>(sv), sv);
 	} else {
-		assert(!"unhandled case, please add code here");
+		assert(!"unhandled case, please add code for int here");
 	}
 }

@@ -174,7 +180,13 @@ test_cast_0xffffffff_to_things()
 	SHOULD_FAIL(ranged_cast<unsigned short>(uv));
 	SHOULD_FAIL(ranged_cast<unsigned char>(uv));
 	SHOULD_PASS(ranged_cast<signed long long>(sv), sv);
-	SHOULD_PASS(ranged_cast<signed long>(sv), sv);
+	if (sizeof(long) == 4) {
+		SHOULD_FAIL(ranged_cast<signed long>(sv));
+	} else if (sizeof(long) == 8) {
+		SHOULD_PASS(ranged_cast<signed long>(sv), sv);
+	} else {
+		assert(!"unhandled case, please add code for long here");
+	}
 	SHOULD_FAIL(ranged_cast<signed short>(sv));
 	SHOULD_FAIL(ranged_cast<signed char>(sv));
 	if (sizeof(int) == 4) {
@@ -182,7 +194,7 @@ test_cast_0xffffffff_to_things()
 	} else if (sizeof(int) == 8) {
 		SHOULD_PASS(ranged_cast<signed int>(sv), sv);
 	} else {
-		assert(!"unhandled case, please add code here");
+		assert(!"unhandled case, please add code for int here");
 	}
 }
Author	SHA1	Message	Date
Zygo Blaxell	65a950bc41	README.md: 32-bit hosts work now	2016-12-27 18:01:30 -05:00
Zygo Blaxell	ef8d92a3cb	resolve: don't stop at the first physical address lookup failure The btrfs LOGICAL_INO ioctl has no way to report references to compressed blocks precisely, so we must always consider all references to a compressed block, and discard those that do not have the desired offset. When we encounter compressed shared extents containing a mix of unique and duplicate data, we attempt to replace all references to the mixed extent with the same number of references to multiple extents consisting entirely of unique or duplicate blocks. An early exit from the loop in BeesResolver::for_each_extent_ref was stopping this operation early, after replacing as few as one shared reference. This left other shared references to the unique data on the filesystem, effectively creating new dup data. The failing pattern looks like this: dedup: replace 0x14000..0x18000 from some other extent copy: 0x10000..0x14000 dedup: replace 0x10000..0x14000 with the copy [may be multiple dedup lines due to multiple shared references] copy: 0x18000..0x1c000 [missing dedup 0x18000..0x1c000 with the copy here] scan: 0x10000 [++++dddd++++] 0x1c000 If the extent 0x10000..0x1c000 is shared and compressed, we will make a copy of the extent at 0x18000..1c0000. When we try to dedup this copy extent, LOGICAL_INO will return a mix of references to the data at logical 0x10000 and 0x18000 (which are both references to the original shared extent with different offsets). If we break out of the loop too early, we will stop as soon as a reference to 0x10000 is found, and ignore all other references to the extent we are trying to remove. The copy at the beginning of the extent (0x10000..0x14000) usually works because all references to the extent cover the entire extent. When bees performs the dedup at 0x14000..0x18000, bees itself creates the shared references with different offsets. Uncompressed extents were not affected because LOGICAL_INO can locate physical blocks precisely if they reside in uncompressed extents. This change will hurt performance when looking up old physical addresses that belong to new data, but that is a much less urgent problem. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2016-12-27 15:23:40 -05:00
Zygo Blaxell	6e7137f282	bees: work around btrfs fsync bug btrfs provides a flush on rename when the rename target exists, so the fsync is not necessary. In the initialization case (when the rename target does not exist and the implicit flush does not occur), the file may be empty or a hole after a crash. Bees treats this case the same as if the file did not exist. Since this condition occurs for only the first 15 minutes of the lifetime of a bees installation, it's not worth bothering to fix. If we attempt to fsync the file ourselves, on a crash with log replay, btrfs will end up with a directory entry pointing to a non-existent inode. This directory entry cannot be deleted or renamed except by deleting the entire subvol. On large filesystems this bug is triggered by nearly every crash (verified on kernels up to 4.5.7). Remove the fsync to avoid the btrfs bug, and accept the failure mode that occurs in the first 15 minutes after a bees install. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2016-12-27 15:20:31 -05:00
Zygo Blaxell	c1e31004b6	crawl: change scan order to make forward progress at all times Previously, the scan order processed each subvol in order. This required very large amounts of temporary disk space, as a full filesystem scan was required before any shared extents could be deduped. If the hash table RAM was underprovisioned this would mean some shared dup blocks were removed from the hash table before they could be deduped. Currently the scan order takes the first unscanned extent from each subvol. This works well if--and only if--the subvols are either empty or children of a common ancestor. It forces the same inode/offset pairs to be read at close to the same time from each subvol. When a new snapshot is created, this ordering diverts scanning to the new subvol until it catches up to the existing subvols. For large filesystems with frequent snapshot creation this means that the scanner never reaches the end of all subvols. Each new subvol effectively resets the current scan position for the entire filesystem to zero. This prevents bees from ever completing the first filesystem scan. Change the order again, so that we now read one unscanned extent from each subvol in round-robin fashion. When a new subvol is created, we share scan time between old and new subvols. This ensures we eventually finish scanning initial subvols and enter the incremental scanning state. The cost of this change is more repeated reading of shared extents at scan time with less benefit from disk-device-level caching; however, the only way to really fix this problem is to implement scanning on tree 2 (the btrfs extent tree) instead of the subvol trees. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2016-12-27 15:15:42 -05:00
Zygo Blaxell	7ecead1700	doc: comment updates We stopped using FIEMAP for a number of reasons. Document some of them. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2016-12-27 15:15:42 -05:00
Zygo Blaxell	efda609f66	log: remove path from thread name The thread name has an arbitrarily limited size, and we are eventually removing support for multiple paths in a single bees daemon process. Signed-off-by: Zygo Blaxell <bees@furryterror.org>	2016-12-27 15:15:16 -05:00
Zygo Blaxell	abd696c524	build: add -D_FILE_OFFSET_BITS=64 to makeflags to build on 32-bit hosts Also update the tests to insist that off_t be at least 64 bits wide.	2016-12-14 19:02:01 -05:00
Zygo Blaxell	e835e8766e	crucible: use set instead of vector in BtrfsExtentWalker This gets rid of some more big memsets. It may replace them with a lot of tiny mallocs, though. If this turns out to be a bad idea then at least we can easily revert the change.	2016-12-13 21:46:41 -05:00
Zygo Blaxell	7782b79e4b	crucible: reduce buffer size and CPU overhead for BtrfsIoctlSearchKey We really do need some large buffers for BtrfsIoctlSearchKey in some cases, but we don't need to zero them out first. Don't do that so we save some CPU. Reduce the default buffer size to 4K because most BISK users don't get need much more than 1K. Set the buffer size explicitly to the product of the number of items and the desired item size in the places that really need a lot of items.	2016-12-13 21:46:35 -05:00
Paul Jones	d7c065e17e	Add native compiler optimization's to compiler flags Signed-off-by: Paul Jones <paul@pauljones.id.au>	2016-12-13 12:53:29 +11:00
Paul Jones	334f5f83ee	Remove unused crc64 function Signed-off-by: Paul Jones <paul@pauljones.id.au>	2016-12-13 12:52:26 +11:00
Paul Jones	8abdeabddc	Make crc64 go faster The current crc64 algorithm is a variant of the Redis implementation. Change it to a variant of the Adler implementation as described at https://matt.sh/redis-crcspeed Test program at https://github.com/PeeJay/crc64-compare Filesize: 1.1G Asking crc64-redis to sum "/media/peejay/BTRFS/1/ubuntu-14.04.5-desktop-amd64.iso"... Asking crc64-adler to sum "/media/peejay/BTRFS/1/ubuntu-14.04.5-desktop-amd64.iso"... Redis CRC-64: f971f9ac6c8ba458 Adler CRC-64: f971f9ac6c8ba458 Adler throughput: 1659.913308 MB/s Redis throughput: 437.284661 MB/s Adler is 3.79x faster than Redis Signed-off-by: Paul Jones <paul@pauljones.id.au>	2016-12-13 12:41:10 +11:00
Zygo Blaxell	f5f4d69ba3	lib: In 2016, Ubuntu still insists on topologically sorted libraries while linking This fixes builds on Ubuntu Server 16.04. Fixes: https://github.com/Zygo/bees/issues/8	2016-12-11 19:53:32 -05:00
Zygo Blaxell	ec9d4a1d15	crucible: fs: use a much smaller default search buffer size It turns out we never use a value for m_buf_size that isn't the default, and we also never ask for more than a few thousand items; however, we do spend a ton of time memsetting the huge buffer to zero. I don't know what the ideal size is, but 16K is a far better guess than 1MB. Let's reduce it for some immediate CPU benefit, and determine what the size should be later. Reported at https://github.com/Zygo/bees/issues/11	2016-12-11 13:24:44 -05:00
Zygo Blaxell	77c11bb90f	bees: add version string and put it in main() and stats file Now that we have more than one bees release it's somewhat important to know which one each bug report is for...	2016-12-08 23:55:59 -05:00
Zygo Blaxell	b5c01c1985	hash: don't throw an exception if MADV_HUGEPAGE fails We don't _need_ transparent hugepages. We like them because they can be faster, but it's not a requirement, and some people will disable transparent hugepages because they make non-Bees-like workloads slow. Try to use MADV_HUGEPAGE, but if it fails, just log the error and continue. MADV_DONTFORK would be useful if we still fork()ed, but we don't currently do that. It's still a useful flag to have because a fork() with more than 50% of RAM in mlocked pages would result in a kernel OOM crash. I don't think it's possible to run Bees on a kernel that does not support the MADV_DONTFORK flag, so don't bother checking for that flag separately.	2016-12-08 23:55:59 -05:00
Zygo Blaxell	d82909387d	README: upgrade kernel requirement to 4.4.3 because of kernel bugs	2016-12-08 23:55:58 -05:00
Zygo Blaxell	1cd6263552	README: document impact of 7f8e406 ("btrfs: improve delayed refs iterations")	2016-12-08 23:55:57 -05:00
Zygo Blaxell	eec80944cd	roots: add a counter for crawl_ms, open_root and open_root_ino Linux kernel commit 7f8e406 ("btrfs: improve delayed refs iterations") seems to dramatically improve LOGICAL_INO performance. Hopefully this commit will find its way into mainline Linux soon. This means that most of the time in Bees is now spent on block reading (50-75%); however, there is still a big gap between block read and the sum of everything else we are measuring with the "*_ms" counters. This gap is about 30% of the run time, so it would be good to find out what's in the gap. Add ms counters around the crawl and open calls to capture where we are spending all the time.	2016-12-08 23:55:39 -05:00
Zygo Blaxell	5a4ff9a0b8	Merge remote-tracking branch 'nefelim4ag/master'	2016-12-02 00:35:51 -05:00
Zygo Blaxell	9506406cff	README: BEESHOME is now relative, UUIDs removed, resizing, file contents	2016-12-02 00:32:32 -05:00
Zygo Blaxell	1c4af5ce5a	main: update usage message BEESHOME is downgraded from required to optional. Don't document the deprecated shared hash table feature.	2016-12-02 00:32:32 -05:00
Zygo Blaxell	642581e89a	hash: remove the experimental shared hash-table and shared mmap features The experiments are over, and the results were not success. Having two filesystems cohabiting in the same hash table results in a lot of false positives, each of which requires some heavy IO to resolve. Using MAP_SHARED to share a beeshash.dat between processes results in catastrophically bad performance. These features were abandoned long ago, but some of the code--and even worse, its documentation--still remains. Bees wants a hash table false positive rate below 0.1%. With a shared hash table the FP rate is about the same as the dedup rate. Typically duplicate files on one filesystem are duplicate on many filesystems. One or more of Linux VFS and the btrfs mmap(MAP_SHARED) implementation produce extremely poor performance results. A five-order-of-magnitude speedup was achieved by implementing paging in userspace with worker threads. We no longer need the support code for the MAP_SHARED case. It is still possible to run many BeesContexts in a single process, but now the only thing contexts share is the FD cache.	2016-12-02 00:26:02 -05:00
Zygo Blaxell	fdfa78a81b	context: default and relative BEESHOME Allow relative paths with BEESHOME. These paths will be relative to the root of the dedup target filesystem. BEESHOME is now optional. If not specified, '.beeshome' is used. We don't try to create BEESHOME if it doesn't exist. BEESHOME might not be on a btrfs filesystem, so we can't insist it be a subvol.	2016-12-02 00:22:18 -05:00
Zygo Blaxell	6fa8de660b	hash: create beeshash.dat if it does not exist BeesHashTable can now create a beeshash.dat if the file does not already exist. Currently the default size is one hash table extent (16MB) and there's no way to change that (yet), so users should still create their own hash tables for now. The opening of the hash table is deferred (slightly) in preparation for hash table resizing. No doc as the feature is currently unfinished.	2016-12-02 00:20:30 -05:00
Zygo Blaxell	d58de9b76d	bees: introduce BEESLOGNOTE macro Quite often we have the same message in BEESLOG and BEESNOTE, so make a macro to combine them.	2016-12-02 00:20:29 -05:00
Zygo Blaxell	ea0910ee6c	crucible: fd: remove dead reference to unlink_or_die, introduce ftruncate_or_die	2016-12-02 00:19:37 -05:00
Zygo Blaxell	dd21e6f848	crucible: add missing template specializations of pwrite helper functions I got a little too enthusiastic when redacting the code, and removed some overloaded functions bees was using. C++ silently found replacements, and the result was a bug that prevented any data from being persisted from the hash table. Fixes: https://github.com/Zygo/bees/issues/7	2016-12-02 00:16:51 -05:00
Zygo Blaxell	06e111c229	crawl: remove UUID from file names Unfortunately we don't get to remove the libuuid dependency because we still want to read a file that exists in the legacy location.	2016-12-02 00:16:03 -05:00
Timofey Titovets	606d48acc1	Add option to make mnt path shorter in logs Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>	2016-11-28 08:23:50 +03:00
Timofey Titovets	bf4e31ae71	Add default values to vars Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>	2016-11-27 06:23:42 +03:00
Timofey Titovets	03c116c3f1	Add Systemd service for bash wrapper Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>	2016-11-27 03:19:31 +03:00
Timofey Titovets	a384cd976a	Add bash wrapper Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>	2016-11-27 03:19:19 +03:00
Zygo Blaxell	38bb70f5d0	build: OK, maybe 32-bit machines could work I accidentally did a pre-push verification on a 32-bit build host. There were a surprisingly small number of problems, so fix them. Bees now builds on a 32-bit host. Let's not update README just yet, though: the 32-bit ioctl support fails immediately after startup on a 64-bit kernel.	2016-11-26 02:06:28 -05:00
Zygo Blaxell	a57404442c	execpipe: remove unreachable debug code This is tripping up builds in stricter build environments. https://github.com/Zygo/bees/issues/2	2016-11-26 01:06:44 -05:00
Zygo Blaxell	1e621cf4e7	README: Improve "about" section and update compiler dependency "agent" is a nice generic term for the set of things that userspace btrfs deduplicators are. Let's call it that. Throw out the awkward and rambling "About" text and use the announcement from linux-btrfs instead. Terrible English writing I at am.	2016-11-24 23:06:28 -05:00
Zygo Blaxell	1303fb9da8	build: fix FTBFS on GCC 6.2 I'm not surprised that GCC 6 doesn't let me send an ostream ref to itself, even inside an uninstantiated template specialization. I am a little surprised I was trying to, and 4.9 let me get away with it. It's 2016. auto_ptr is deprecated now. Some things were including vector that don't any more. https://github.com/Zygo/bees/issues/1	2016-11-24 22:20:11 -05:00