btrfs & ext4 – error handling when the hardware fails …

I have a mini PC (old intel NUC) I use for taking backups of my desktop. It has a single 4TiB ssd in it.

Filesystem Type Size Used Avail Use% Mounted on
/dev/sda3 ext4 916G 80G 790G 10% /
/dev/sda4 btrfs 2.8T 106G 2.7T 4% /backup

I’ve been using btrfs for ages for /backup as I use the snapshot functionality of btrfs with an hourly rsync job from my desktop to copy changes over.

Recently the fan on the NUC failed, and while overheating (I think) it appears to have written garbage in various places (this was seen on the ext4 rootfs as well as the /backup btrfs volume).

BTRFS

Trying to scrub the filesystem highlights the problems –

root@nectarine:~# btrfs scrub status /backup
UUID:             36f93b26-6187-4874-8cc6-4d4bd092e7d8
Scrub resumed:    Sat Jun 17 13:48:33 2023
Status:           finished
Duration:         1:21:28
Total to scrub:   1.23TiB
Rate:             263.66MiB/s
Error summary:    csum=60
  Corrected:      0
  Uncorrectable:  60
  Unverified:     0

(As I only have one underlying block device, it’s not possible for it to repair itself).

I now also see messages like this in ‘dmesg’ –

[ 3570.123946] BTRFS error (device sda4): unable to fixup (regular) error at logical 1870167986176 on dev /dev/sda4
[ 3570.128866] BTRFS error (device sda4): bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 199, gen 0
[ 3570.128862] BTRFS warning (device sda4): checksum error at logical 1870167683072 on dev /dev/sda4, physical 1477245284352, root 8890, inode 3750321, offset 384077824, length 4096, links 1 (path: .icedove/e1kre066.default-release-2/ImapMail/imap.gmail-2.com/INBOX-1)

Before trying to re-initialise the checksum tree (And then just let the corrupt files expire out of the filesystem with time as they get rsync’ed over) I thought I’d try :

root@nectarine:~# btrfs check -p /dev/sda4 
Opening filesystem to check...
Checking filesystem on /dev/sda4
UUID: 36f93b26-6187-4874-8cc6-4d4bd092e7d8
[1/7] checking root items                      (0:00:10 elapsed, 6406461 items checked)
Segmentation faultents                         (0:00:02 elapsed, 7542 items checked)

So that didn’t work very well.

So I thought I might as well try just re-initialising the checksum tree –

root@nectarine:~# btrfs check -p --init-csum-tree /dev/sda4 
Creating a new CRC tree
WARNING:

	Do not use --repair unless you are advised to do so by a developer
	or an experienced user, and then only after having accepted that no
	fsck can successfully repair all types of filesystem corruption. Eg.
	some software or hardware bugs can fatally damage a volume.
	The operation will start in 10 seconds.
	Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/sda4
UUID: 36f93b26-6187-4874-8cc6-4d4bd092e7d8
Reinitialize checksum tree
kernel-shared/extent_io.c:650: free_extent_buffer_internal: BUG_ON `eb->refs < 0` triggered, value 1
btrfs(+0x2b1f7)[0x5590e079d1f7]
btrfs(+0x2b381)[0x5590e079d381]
btrfs(+0x2b68e)[0x5590e079d68e]
btrfs(alloc_extent_buffer+0x77)[0x5590e079e740]
btrfs(read_tree_block+0x47)[0x5590e0796066]
btrfs(read_node_slot+0x47)[0x5590e078f7fd]
btrfs(btrfs_next_sibling_tree_block+0x95)[0x5590e0792900]
btrfs(+0x19e14)[0x5590e078be14]
btrfs(+0x1a8a8)[0x5590e078c8a8]
btrfs(iterate_extent_inodes+0x68)[0x5590e078d5dc]
btrfs(fill_csum_tree+0x46b)[0x5590e07f9440]
btrfs(+0x74bf2)[0x5590e07e6bf2]
btrfs(main+0x3d3)[0x5590e078a203]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7ff38d37fd0a]
btrfs(_start+0x2a)[0x5590e078a86a]
Aborted

So I don’t feel that worked all that well.

I guess I’ll copy off the data I don’t want to lose, and just reformat it. I was hoping the repair tools (btrfs-progs v6.2, kernel 6.1.34) had hopefully matured since I last broke a btrfs filesystem (a few years ago). I guess not?

I know btrfs is at least alerting me to issues with the data – which ext4 definitely isn’t (given /var/lib/dpkg/status contained a load of trash) – so I’ll give it credit for that. It’s just a shame the ‘repair’ tools aren’t working that well.

ext4

This isn’t written to much on this system – there’s a munin daemon running (so /var/lib/munin will have been written to) and a few log files.

Interestingly, when I first noticed a problem with the device, after logging in, I instinctively ran ‘apt-get update’ (I was hoping a reboot would fix it, at which point I might as well make sure any updates were installed).

Running ‘apt-get update’ resulted in /var/lib/dpkg/status being full of rubbish.

After the PC had been turned on for a few hours, ext4 eventually figured there were problems with it – by logging this :

[11591.230282] munin-html[22255]: segfault at a400000e ip 0000557783eaf0e9 sp 00007ffca1d969f0 error 4 in perl[557783de1000+185000] likely on CPU 3 (core 1, socket 0)
[11591.230298] Code: 4e 0c 89 56 08 83 e9 09 83 f9 01 76 14 83 fa 01 76 3f 83 ea 01 89 55 08 48 83 c4 10 5d c3 0f 1f 00 48 8b 70 08 48 85 f6 74 e3 <f6> 46 0e 10 74 dd 48 c7 40 08 00 00 00 00 8b 56 08 83 fa 01 76 22
[11591.432906] munin-graph[22257]: segfault at 55a6b77c7df0 ip 000055a64601ebc2 sp 00007ffcd88c5150 error 4 in perl[55a645fc0000+185000] likely on CPU 3 (core 1, socket 0)
[11591.432927] Code: 0f 1f 84 00 00 00 00 00 48 8b 4f 10 48 85 c9 74 5f 48 83 ec 08 48 8b 87 30 01 00 00 48 8b 50 10 48 39 d1 75 4c 48 85 f6 74 55 <48> 8b 04 f1 48 85 c0 74 20 48 8d 97 50 01 00 00 48 39 d0 74 14 8b
[12723.693630] EXT4-fs error (device sda3): htree_dirblock_to_tree:1080: inode #28706704: comm find: Directory block failed checksum
[12723.693673] Aborting journal on device sda3-8.
[12723.696920] EXT4-fs error (device sda3): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[12723.696945] EXT4-fs error (device sda3): ext4_journal_check_start:83: comm rs:main Q:Reg: Detected aborted journal
[12723.708257] EXT4-fs (sda3): Remounting filesystem read-only

Rebooting and running : fsck -Cy /dev/sda3 MIGHT have fixed the rootfs.

Leave a Reply

Your email address will not be published. Required fields are marked *