A customer has a relatively busy web site, which contains lots of juicy information (business names, addresses, email address, phone numbers etc etc). Currently there is nothing in place to stop people spidering it – unless someone explicitly looks at the log files and does something.
Blocking annoying people who spider the site is easy enough –
iptables -I INPUT -s 80.x.x.x -j REJECT
However, I’d obviously rather automate this if possible – and ideally without having to change the PHP code (as each request would need perform some sort of DB lookup it’s part of a spidering attempt)
So, my first idea was to manipulate an existing rule I have to limit SSH connection attempts, giving something like :
iptables -I INPUT -p tcp --dport 80 -i eth0 -m state --state NEW -m recent --set
iptables -I INPUT -p tcp --dport 80 -i eth0 -m state --state NEW -m recent --update --seconds 60 --hitcount 40 -j LOG --log-prefix "http spidering?" --log-ip-options --log-tcp-options --log-tcp-sequence --log-level 4
Annoyingly however, even though these are the first rules in the iptables output – and they should therefore work, they don’t – i.e. I’m not seeing anything being logged, when doing e.g. the following on a remote server :
while [ true ] ; do
wget -q -O - http://server.xyz/index.php
done
So, I’m still trying to avoid making changes to the code base – although doing so would produce the best user experience (namely we could display a captcha or something and if someone really can browse that quickly they’d not encounter any problems).
And as I’ve just found mod_evasive which claims to provide DoS and DDoS protection. Thankfully Jason Litka has packaged it – so I have no problems from an installation point of view 🙂 (yum install mod_evasive)
Installation on Debian doesn’t result in a config file – but it’s not difficult to create (see /usr/share/doc/mod_evasive). However, it’s not a shiney, sunny ending – mod_evasive appears to be “tripped” by people requesting images – and in my case the client has about 10-20 images per page; so it’s difficult to differentiate between a normal user loading a page or someone running httrack on the website and only requesting the “php page”. If only mod_evasive took a regexp to ignore/match… and I can’t seem to find anyway of fixing this.
So application logic it is :-/ Perhaps caching in APC may be the way forward ….
Well checkpoint has a lot of this functionality built in, although it can be a pain to get it working just the way you want, as some of the application security features are a little buggy(or plain just dont work). Although for what you want its pretty solid. Although it is proprietry and damn expensive so doubt its your sort of thing 😉
Dish up the images from another subdomain, for example images.client.com – they are static content and you should then be able to build mod_evasive settings so that they only kick in on requess to the main application.
This sort of architecture would help you anyway in future, when you look to putting relatively static assets like images and Javascript into a CDN or something?
Finally keeping them on a different VirtualHost (or subdomain) would allow a fairly simple load balancer to distinguish between them: thus you could build $solution to employ 5-6 beefy ‘application’ servers and just a couple of other boxes for ‘static’ content, and potentially deploy different caching, software (Apache vs. nginx) and tuning to make it all funky and fast.
Then you can apply mod_evasive settings
Obviously – but there’s still the problem that some pages need rate limiting and others don’t. I don’t care if someone requests the home page a zillion times in one minute – it’s effectively static (well, PHP, but no DB calls). I do care if someone starts going through each business one after the other leeching their details.
Yes – I could split images off onto another domain – unfortunately the code base is horrible, and it would have to be enforced via e.g. mod_rewrite. I don’t think we’re yet at the position of needing to do this.
I’ve written a PHP solution, which i’ll soon post here, which does at least allow for a friendly ‘error’ page and a captcha to fill in which makes it more user friendly, and less likely to fsck up if there are a lot of users behind one proxy (for example).
Did you ever do a write up on your PHP solution? I’m having a similar issue with scrapers crawling my site too quickly. I’m looking at mod_evasive, but have the same concerns that you do. Thanks!
hitcount can’t be higher than 20 using the default iptables settings. if you set it higher, the rule never gets triggered.
also, i think you want to reverse the order of the rules. otherwise, the set gets triggered, but the update doesn’t.