Bedup is a tool that can scan and deduplicate an existing btrfs filesystem.
In other words, it can find identical files on the filesystem and tell BTRFS they are the same – so you only store one copy of the data rather than N copies (kind of similar to how you could manually use hard links to de-duplicate files).
Anyway, installation of bedup is straight forward enough (see it’s README.md file, unfortunately it’s not packaged for Debian).
The only catch I found was that by default it doesn’t look to de-duplicate files smaller than 8Mb, and I needed to either call ‘sync’ or use the –flush argument if I’d only recently copied/created files on disk.
Random example of usage :
root@box:/test/wordpress# cp wordpress-4.5.2.tar.gz wordpress-4.5.2.tar.gz.copy root@box:/test/wordpress# ls -l .... -rw-r--r-- 1 root root 7770470 May 22 15:12 wordpress-4.5.2.tar.gz -rw-r--r-- 1 root root 7770470 May 22 15:12 wordpress-4.5.2.tar.gz.copy root@box:/test/wordpress# /usr/local/bin/bedup dedup /test/ --sizecutoff=1024 --flush ... Scanning volume /test/ generations from 473678 to 473681, with size cutoff 1024 00.01 Scanned 81 retained 2 Deduplicating filesystem Deduplicated: - '/test/wordpress/wordpress-4.5.2.tar.gz' - '/test/wordpress/wordpress-4.5.2.tar.gz.copy' 00.19 Size group 1/1 (7770470) sampled 2 hashed 2 freed 7770470 00.02 Committing tracking state
Interesting bits :
- “freed” is bytes, so we’ve “saved” ~7.5mb here.
- “–size-cutoff=1024” – so we’re only going to de-duplicate files larger than 1kb.
Does it also dedup individual blocks ?
That be nice really for vm images.
bedup – is only at the file level as far as I understand it.
A BTRFS snapshot would allow for more sharing (parts within a file) but I’m don’t know what chunk size is used.