I’m hosting a couple dozen websites on a server running Rocky Linux 8. For the last two weeks or so, my monitoring server reports unusual system load and activity on this server. A peek in /var/log/httpd confirms that there is suddenly a tsunami of crawler activity. My guess would be AI-related stuff, but the thing is, I’m constantly at 99% CPU activity with php-fpm processes galore.
I experimented a bit, and here’s a solution based on Fail2ban that seems to work (sort of). First, I create the /etc/fail2ban/filter.d/apache-http-request.conf file like this:
Hi, it seems like you are looking for a fully automated solution, but may I suggest creating some shell scripts you place in /etc/cron.hourly which email you hourly summaries of what your server is seeing, for example extract and sort the top 20 IP addresses (or alternatively summarizing by the first three octets, think “snowshoe”), based on the number of accesses so far during the current date, or based on the total bytes transferred so far during the current date, you can then use firewalld to block the associated CIDR address block if the activity is malicious or excessive. After blocking a few dozen CIDR address blocks, you just might find a lot less load on your server. And if you need help with creating those scripts, you can feed examples of the log lines into Microsoft Copilot, and ask it to create sample scripts for your, which you can then tweak.
I fiddled around with Fail2ban and found a solution that works perfectly and keeps the most obnoxious crawlers out, without banning search engine crawlers though.