BTRFS file deduplication with bedup

Bedup is a tool that can scan and deduplicate an existing btrfs filesystem.

In other words, it can find identical files on the filesystem and tell BTRFS they are the same Рso you only store one copy of the data rather than N copies (kind of similar to how you could manually use hard links to de-duplicate files).

Anyway, installation of bedup is straight forward enough (see it’s file, unfortunately it’s not packaged for Debian).

The only catch I found was that by default it doesn’t look to de-duplicate files smaller than 8Mb, and I needed to either call ‘sync’ or use the –flush argument if I’d only recently copied/created files on disk.

Random example of usage :

root@box:/test/wordpress# cp wordpress-4.5.2.tar.gz wordpress-4.5.2.tar.gz.copy
root@box:/test/wordpress# ls -l
-rw-r--r-- 1 root root 7770470 May 22 15:12 wordpress-4.5.2.tar.gz
-rw-r--r-- 1 root root 7770470 May 22 15:12 wordpress-4.5.2.tar.gz.copy

root@box:/test/wordpress# /usr/local/bin/bedup dedup /test/ --sizecutoff=1024 --flush
Scanning volume /test/ generations from 473678 to 473681, with size cutoff 1024
00.01 Scanned 81 retained 2
Deduplicating filesystem 
- '/test/wordpress/wordpress-4.5.2.tar.gz'
- '/test/wordpress/wordpress-4.5.2.tar.gz.copy'
00.19 Size group 1/1 (7770470) sampled 2 hashed 2 freed 7770470
00.02 Committing tracking state

Interesting bits :

  • “freed” is bytes, so we’ve “saved” ~7.5mb here.
  • “–size-cutoff=1024” – so we’re only going to de-duplicate files larger than 1kb.

  1. bedup – is only at the file level as far as I understand it.

    A BTRFS snapshot would allow for more sharing (parts within a file) but I’m don’t know what chunk size is used.

