Minimal WordPress Fail2ban integration

I used to have a fail2ban filter etc setup to look for POST requests to wp-login.php; but the size of the Apache log files on one server made this infeasible (it took fail2ban too long to parse/process the files). Also, doing a filter on the Apache log file looking for POST /wp-login … means you are also catching someone successfully logging in.

Perhaps this is a better approach :

Assumptions

  • You’re using PHP configured with an error_log = /var/log/php.log

    • If this isn’t configured, PHP will probably log to the webserver’s error log file (/var/log/apache2/error.log perhaps).

  • The Apache/PHP processes are able to write to the error_log file.
  • You’re using Debian or Ubuntu Linux

Add a ‘must use’ wordpress plugin

Put this in … /path/to/your/site/wp-content/mu-plugins/log-auth-failures.php

(It must be wp-content/mu-plugins … )

<?php
add_action( ‘wp_login_failed’, ‘login_failed’ );
function login_failed( $username ) {
error_log(“WORDPRESS LOGIN FAILURE {$_SERVER[‘REMOTE_ADDR’]} – user $username from ” . __FILE__);
}

(Yes, obviously you don’t have to use error_log, you could do something else, and there’s a good argument not to log $username as it’s ultimately user supplied data that might just mess things up)

Fail2ban config

Then in /etc/fail2ban/jail.d/wordpress-login.conf :

[wordpress-login]
enabled = true
filter = wordpress-login
action = iptables-multiport[name=wp, port="80,443", protocol=tcp]
logpath = /var/log/php.log
maxretry = 5

If you have PHP logging somewhere else, change the logpath appropriately.

Finally in /etc/fail2ban/filter.d/wordpress-login.conf put :

[Definition]

# PHP error_log is logging to /var/log/php.log something like :
#[31-Jan-2024 20:34:10 UTC] WORDPRESS LOGIN FAILURE 1.2.3.4  - user admin.person from /var/www/vhosts/mysite/httpdocs/wp-content/mu-plugins/log-auth-failures.php

failregex = WORDPRESS LOGIN FAILURE <HOST> - user 


ignoreregex =

Extra bonus points for making the failregex stricter, or stop including $username in the log output (which perhaps makes it vulnerable to some sort of injection attack).

There’s probably a good argument for using a different file (not the PHP error_log) so other random error messages can’t confuse fail2ban, which might also allow you to specify a more friendly date format for fail2ban etc….

Finally …

Restart fail2ban and watch it’s log file (and /var/log/php.log).

service fail2ban restart

Excessive uptime(!?)

Somewhere on the internet there’s a mailserver with a larger uptime, I guess?

[root@xxxxxxxx ~]# uname -a
Linux xxxxxxxxxxxxxxx 2.6.18-419.el5 #1 SMP Fri Feb 24 22:47:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

[root@xxxxxxxx ~]# uptime
09:34:38 up 2290 days,  1:47,  ....

I don’t think anyone dares to reboot it …. (this is a server the customer was going to migrate off about 5 years ago …. somehow it’s still in use)

(2290 days is a little over 6 years)

btrfs & ext4 – error handling when the hardware fails …

I have a mini PC (old intel NUC) I use for taking backups of my desktop. It has a single 4TiB ssd in it.

Filesystem Type Size Used Avail Use% Mounted on
/dev/sda3 ext4 916G 80G 790G 10% /
/dev/sda4 btrfs 2.8T 106G 2.7T 4% /backup

I’ve been using btrfs for ages for /backup as I use the snapshot functionality of btrfs with an hourly rsync job from my desktop to copy changes over.

Recently the fan on the NUC failed, and while overheating (I think) it appears to have written garbage in various places (this was seen on the ext4 rootfs as well as the /backup btrfs volume).

BTRFS

Trying to scrub the filesystem highlights the problems –

root@nectarine:~# btrfs scrub status /backup
UUID:             36f93b26-6187-4874-8cc6-4d4bd092e7d8
Scrub resumed:    Sat Jun 17 13:48:33 2023
Status:           finished
Duration:         1:21:28
Total to scrub:   1.23TiB
Rate:             263.66MiB/s
Error summary:    csum=60
  Corrected:      0
  Uncorrectable:  60
  Unverified:     0

(As I only have one underlying block device, it’s not possible for it to repair itself).

I now also see messages like this in ‘dmesg’ –

[ 3570.123946] BTRFS error (device sda4): unable to fixup (regular) error at logical 1870167986176 on dev /dev/sda4
[ 3570.128866] BTRFS error (device sda4): bdev /dev/sda4 errs: wr 0, rd 0, flush 0, corrupt 199, gen 0
[ 3570.128862] BTRFS warning (device sda4): checksum error at logical 1870167683072 on dev /dev/sda4, physical 1477245284352, root 8890, inode 3750321, offset 384077824, length 4096, links 1 (path: .icedove/e1kre066.default-release-2/ImapMail/imap.gmail-2.com/INBOX-1)

Before trying to re-initialise the checksum tree (And then just let the corrupt files expire out of the filesystem with time as they get rsync’ed over) I thought I’d try :

root@nectarine:~# btrfs check -p /dev/sda4 
Opening filesystem to check...
Checking filesystem on /dev/sda4
UUID: 36f93b26-6187-4874-8cc6-4d4bd092e7d8
[1/7] checking root items                      (0:00:10 elapsed, 6406461 items checked)
Segmentation faultents                         (0:00:02 elapsed, 7542 items checked)

So that didn’t work very well.

So I thought I might as well try just re-initialising the checksum tree –

root@nectarine:~# btrfs check -p --init-csum-tree /dev/sda4 
Creating a new CRC tree
WARNING:

	Do not use --repair unless you are advised to do so by a developer
	or an experienced user, and then only after having accepted that no
	fsck can successfully repair all types of filesystem corruption. Eg.
	some software or hardware bugs can fatally damage a volume.
	The operation will start in 10 seconds.
	Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/sda4
UUID: 36f93b26-6187-4874-8cc6-4d4bd092e7d8
Reinitialize checksum tree
kernel-shared/extent_io.c:650: free_extent_buffer_internal: BUG_ON `eb->refs < 0` triggered, value 1
btrfs(+0x2b1f7)[0x5590e079d1f7]
btrfs(+0x2b381)[0x5590e079d381]
btrfs(+0x2b68e)[0x5590e079d68e]
btrfs(alloc_extent_buffer+0x77)[0x5590e079e740]
btrfs(read_tree_block+0x47)[0x5590e0796066]
btrfs(read_node_slot+0x47)[0x5590e078f7fd]
btrfs(btrfs_next_sibling_tree_block+0x95)[0x5590e0792900]
btrfs(+0x19e14)[0x5590e078be14]
btrfs(+0x1a8a8)[0x5590e078c8a8]
btrfs(iterate_extent_inodes+0x68)[0x5590e078d5dc]
btrfs(fill_csum_tree+0x46b)[0x5590e07f9440]
btrfs(+0x74bf2)[0x5590e07e6bf2]
btrfs(main+0x3d3)[0x5590e078a203]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7ff38d37fd0a]
btrfs(_start+0x2a)[0x5590e078a86a]
Aborted

So I don’t feel that worked all that well.

I guess I’ll copy off the data I don’t want to lose, and just reformat it. I was hoping the repair tools (btrfs-progs v6.2, kernel 6.1.34) had hopefully matured since I last broke a btrfs filesystem (a few years ago). I guess not?

I know btrfs is at least alerting me to issues with the data – which ext4 definitely isn’t (given /var/lib/dpkg/status contained a load of trash) – so I’ll give it credit for that. It’s just a shame the ‘repair’ tools aren’t working that well.

ext4

This isn’t written to much on this system – there’s a munin daemon running (so /var/lib/munin will have been written to) and a few log files.

Interestingly, when I first noticed a problem with the device, after logging in, I instinctively ran ‘apt-get update’ (I was hoping a reboot would fix it, at which point I might as well make sure any updates were installed).

Running ‘apt-get update’ resulted in /var/lib/dpkg/status being full of rubbish.

After the PC had been turned on for a few hours, ext4 eventually figured there were problems with it – by logging this :

[11591.230282] munin-html[22255]: segfault at a400000e ip 0000557783eaf0e9 sp 00007ffca1d969f0 error 4 in perl[557783de1000+185000] likely on CPU 3 (core 1, socket 0)
[11591.230298] Code: 4e 0c 89 56 08 83 e9 09 83 f9 01 76 14 83 fa 01 76 3f 83 ea 01 89 55 08 48 83 c4 10 5d c3 0f 1f 00 48 8b 70 08 48 85 f6 74 e3 <f6> 46 0e 10 74 dd 48 c7 40 08 00 00 00 00 8b 56 08 83 fa 01 76 22
[11591.432906] munin-graph[22257]: segfault at 55a6b77c7df0 ip 000055a64601ebc2 sp 00007ffcd88c5150 error 4 in perl[55a645fc0000+185000] likely on CPU 3 (core 1, socket 0)
[11591.432927] Code: 0f 1f 84 00 00 00 00 00 48 8b 4f 10 48 85 c9 74 5f 48 83 ec 08 48 8b 87 30 01 00 00 48 8b 50 10 48 39 d1 75 4c 48 85 f6 74 55 <48> 8b 04 f1 48 85 c0 74 20 48 8d 97 50 01 00 00 48 39 d0 74 14 8b
[12723.693630] EXT4-fs error (device sda3): htree_dirblock_to_tree:1080: inode #28706704: comm find: Directory block failed checksum
[12723.693673] Aborting journal on device sda3-8.
[12723.696920] EXT4-fs error (device sda3): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[12723.696945] EXT4-fs error (device sda3): ext4_journal_check_start:83: comm rs:main Q:Reg: Detected aborted journal
[12723.708257] EXT4-fs (sda3): Remounting filesystem read-only

Rebooting and running : fsck -Cy /dev/sda3 MIGHT have fixed the rootfs.

systemd-resolve (DNS is always to blame)

For the record, this is using systemd v247, from Debian’s buster-backports.

I think I was enticed by the cool aid, hoping to be able to have DNSSEC or DNSoverTLS …. and caching … and to be fair, it appeared to work on all the servers I’d installed it on (although they were just ‘boring’ LAMP style webservers).

Anyway, everything seemed to be going well, with the default /etc/resolv.conf like :

nameserver 127.0.0.53

options edns0

and /etc/systemd/resolved.conf looking like :

[Resolve]
DNS=8.8.8.8#dns.google 8.8.4.4#dns.google 1.1.1.1
FallbackDNS=1.1.1.1 8.8.4.4 9.9.9.9
LLMNR=no
DNSOverTLS=opportunistic
DNSSEC=no
Cache=yes

Unfortunately, on one relatively busy server which makes multiple HTTP requests out every second, I saw sporadic failures where curl would report a timeout for e.g. graph.facebook.com (>10 connect time).

The timeouts seemed to be grouped together (no timeouts for a number of hours, and then a load of requests would fail) and obviously to be annoying this only happened in production and wasn’t something I could reproduce.

As best I can tell, a failure to lookup was being cached, so all requests for a specific hostname would then fail until the cache expired (30 seconds?)

So I end up having /etc/resolv.conf looking a bit more like a traditional one with 8.8.8.8 as the first nameserver and some custom options to lower the retry time and hopefully trigger multiple DNS lookup attempts.

So, perhaps …. perhaps … systemd-resolve isn’t quite ready for production yet?

postfix / postscreen and dns blacklist fun

I decided to stop using my hacky perl script for Postfix policyd stuff as it’s ages since I wrote any perl … and instead use postscreen the other day.

Postscreen setup – was fairly easy – there’s a load of config below.

Gotchas – spamhaus doesn’t like you if you might be sending your DNS through a public resolver (E.g. 8.8.8.8) – so you need to do an =127.0.0.[1..11] to it.

It also logs quite a lot.

Current Postfix postscreen main.cf config :

postscreen_access_list = permit_mynetworks, cidr:/etc/postfix/postscreen_access.cidr
postscreen_dnsbl_threshold = 2
postscreen_dnsbl_sites = zen.spamhaus.org=127.0.0.[2..11]*2
       bl.spamcop.net*1 
       b.barracudacentral.org=127.0.0.2*1
       bl.spameatingmonkey.net*1
       bl.mailspike.net*1
       tor.ahnl.org*1
       dnsbl.justspam.org=127.0.0.2*1
       bip.virusfree.cz*1
       spam.dnsbl.sorbs.net=127.0.0.6*1

postscreen_greet_action = enforce
postscreen_greet_wait = 5s
postscreen_greet_ttl = 2d

postscreen_blacklist_action = drop
postscreen_dnsbl_ttl = 2h

SMTP Auth whitelisting …

My server allows people to send out authenticated on port 25, but postscreen doesn’t seem to be aware of this when it runs; so such people may be blocked by their IP being in a DNS Blacklist … and therefore need explicitly whitelisting via a dovecot postlogin script (example below) which if used, requires the postscreen_access_list to change to be something like :

postscreen_access_list = permit_mynetworks
	cidr:/etc/postfix/postscreen_access.cidr
	mysql:/etc/postfix/mysql/check_mail_log.cf

and /etc/postfix/mysql/check_mail_log.cf looks like :

user = mail_log
password = something
hosts = 127.0.0.1
dbname = mail_log
query = SELECT 'permit' FROM mail_log WHERE ip_address = '%s' UNION SELECT 'dunno' LIMIT 1 ;

The dovecot config change(s) are – in /etc/dovecot/dovecot.conf

....
service pop3 {
	executable = pop3 postlogin
}
service imap {
	executable = imap postlogin
}

service postlogin {
	executable = script-login /etc/dovecot/postlogin.sh
	user = $default_internal_user
	unix_listener postlogin {
	}
}
....

and /etc/dovecot/postlogin.sh looks a bit like :

#!/bin/bash

if [ "x${IP}" != "x" ]; then
	if [ ! "$IP" = "127.0.0.1" ]; then
		echo "INSERT INTO mail_log (username, ip_address) VALUES ('$USER', '$IP')" | mysql --defaults-extra-file=/etc/dovecot/mysql.cnf mail_log
	fi
fi

exec "$@"

exit 0

The /etc/dovecot/postlogin.sh will need to be executable.

/etc/dovecot/mysql.cnf just looks like a normal MySQL cnf file –

[client]
user = mail_log
password = something
database = mail_log

CREATE TABLE `mail_log` (
  `username` varchar(255) NOT NULL,
  `ip_address` varchar(255) NOT NULL,
  `dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  KEY `mlip` (`ip_address`(191)),
  KEY `dt_idx` (`dt`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

and that being the SQL schema.

Ideally I suppose you’d add a cron job to prune entries in mail_log older than a set time, and probably have a unique key on username with some sort of “INSERT INTO x ON DUPLICATE … ” change to the postlogin.sh script above.

Packer and Azure

I needed to build some Virtual Machine images (using packer) for work the other day.

I already have a configuration setup for packer (but for AWS) and when trying to add in support for an ‘azure-arm‘ builder, I kept getting the following error message in my web browser as I attempted to authenticate packer with azure :

“AADSTS650052: The app needs to access to a service (https://vault.azure.net) that your organization \”<random-id>\” has not subscribed or enabled. Contact your IT Admin to review the configuration of your service subscriptions.”

This isn’t the most helpful of error messages, when I’m probably meant to be the “IT Admin”.

After eventually giving in (as I couldn’t find any similar reports of this problem) and reaching out to our contact in Microsoft, it turns out we needed to enable some additional Resource Providers in the Subscription…. and of course the name has to be slightly different 😉 (Microsoft.KeyVault). Oh well….

Having done this, Packer does now work (Hurrah!)

Hopefully this will help someone else in the future.