A customer has a relatively busy web site, which contains lots of juicy information (business names, addresses, email address, phone numbers etc etc). Currently there is nothing in place to stop people spidering it – unless someone explicitly looks at the log files and does something.
Blocking annoying people who spider the site is easy enough –
iptables -I INPUT -s 80.x.x.x -j REJECT
However, I’d obviously rather automate this if possible – and ideally without having to change the PHP code (as each request would need perform some sort of DB lookup it’s part of a spidering attempt)
So, my first idea was to manipulate an existing rule I have to limit SSH connection attempts, giving something like :
iptables -I INPUT -p tcp --dport 80 -i eth0 -m state --state NEW -m recent --set
iptables -I INPUT -p tcp --dport 80 -i eth0 -m state --state NEW -m recent --update --seconds 60 --hitcount 40 -j LOG --log-prefix "http spidering?" --log-ip-options --log-tcp-options --log-tcp-sequence --log-level 4
Annoyingly however, even though these are the first rules in the iptables output – and they should therefore work, they don’t – i.e. I’m not seeing anything being logged, when doing e.g. the following on a remote server :
while [ true ] ; do
wget -q -O - http://server.xyz/index.php
So, I’m still trying to avoid making changes to the code base – although doing so would produce the best user experience (namely we could display a captcha or something and if someone really can browse that quickly they’d not encounter any problems).
And as I’ve just found mod_evasive which claims to provide DoS and DDoS protection. Thankfully Jason Litka has packaged it – so I have no problems from an installation point of view 🙂 (yum install mod_evasive)
Installation on Debian doesn’t result in a config file – but it’s not difficult to create (see /usr/share/doc/mod_evasive). However, it’s not a shiney, sunny ending – mod_evasive appears to be “tripped” by people requesting images – and in my case the client has about 10-20 images per page; so it’s difficult to differentiate between a normal user loading a page or someone running httrack on the website and only requesting the “php page”. If only mod_evasive took a regexp to ignore/match… and I can’t seem to find anyway of fixing this.
So application logic it is :-/ Perhaps caching in APC may be the way forward ….