Search

Recent Posts

Tags


« | Main | »

Banning ‘Bad’ Robots & Crawlers

By Dale Reagan | February 4, 2012

Every now and then I see web traffic spikes due to ‘bad’ web crawlers – So what is a ‘bad bot’ or ‘bad web crawler’ or ‘bad web spider’?

Virtual Rights – do we have any?

Not really –  I think that we should have expectations of non-abusive access to any resource that we have on the Internet.

But, Yes you can establish some ‘rights’  if you dig in a bit; to some extent you can limit the types of access that you allow to your content/services.  You may be interested in related posts (mod_security, mod_geoip, iptables, etc.)

My definition of a ‘good’ spider/crawler:

  1. obeys robots.txt directives
  2. behaves ‘reasonably’ (throttles itself)
  3. provides some level of benefit to the site (i.e. a public index, i.e. Google, Yahoo, Bing, etc.)
  4. is easily identified by it’s IP address as being managed by a ‘real’ search engine (i.e. the IP will show ownership by Google, Bing, etc.)
  5. the ‘user-agent’ clearly identifies the ‘bot’ and it’s ‘parent’ along with a contact URL

My definition(s) of a ‘bad’ spider/crawler:

  1. any automated access to my web resources from which I see no benefit
  2. any automated access which ignores robots.txt directives
  3. any automated access which consumes excessive resources (i.e. non-normal, non-user type activity)
  4. any automated access which attempts ‘hide’ it’s real nature
  5. any automated access which cannot be ‘entity identified’ (i.e. the IP address should belong to the spider owner)
  6. any automated access which repeats resource requests (i.e. the same URL is requested over and over)
  7. any ‘crawler for hire’ (i.e. web services that are paid to hit your site without regard or consideration to your ‘virtual rights’…)

How can you block ‘bad’ spiders/crawlers/auto-mated web slurpers?

An example of a ‘bad spider’ – 80legs.com

  1. 200 | 2938 [ ~3000 successful web requests ]
  2. 404 | 3397 [ ~3400 ‘missing page’ web requests ]
  3. 400 | 342 [ ~350 ‘code 400’ web requests – errors typical seen due to ‘hacking’… ]

Reviewing the Apache logs we see ~3000 ‘good’ web requests along with ~3400 ‘404’ (not found) errors and then ~350 ‘protocol’ type errors…

The ‘top’ 15 GeoIP Countries were (# of unique IPs | country):

  1.   559 | RU
  2.   257 | UA
  3.    53 | IN
  4.    34 | BY
  5.    19 | SA
  6.    17 | EG
  7.    14 | US
  8.     9 | KZ
  9.     9 | GB
  10.     7 | IQ
  11.     7 | DE
  12.     6 | PK
  13.     6 | EE
  14.     5 | IT
  15.     5 | AE

Clearly this ‘bot’ shows what I deem to be multiple ‘bad behaviors’…  More examples – excessive requests for the same resource (# of requests | URI):

  1.   670 | / # What type of BOT needs 670 requests for your home page in order to ‘get it’?
  2.   373 | /topic_URL/
  3.   326 | /terms-of-use.htm
  4.   323 | /about.htm
  5.   316 | /advertise/
  6.   308 | /privacy.htm

Blocking BAD Bots/Crawlers/Spiders

Banning the 80 Legs Spider

You can create rules in your ‘.htaccess’ files or in your Apache virtual host Configurations, i.e. you locate enough unique text that can provide you with confidence that you are blocking the ‘right’ bot’.   In this case there were only two unique ‘Agents’ used to generate the traffic I found:

  1. Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620
  2. Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html;) Gecko/2008032620 [ only 9 requests by this version of the agent – all of these were requests for ‘robots.txt’]

We need something  to ‘match against’ so using the ‘bot name’ [which is ‘008’ in this case] may not be adequate since it could match any agent that contained the same string.  Using the reference URL should be fine.  Both examples searches the ‘USER_AGENT’ string (for a match with all of the text between the parenthesis) or the simple string after the equals sign…

# block user agents from site access
RewriteCond %{HTTP_USER_AGENT} (www.80legs.com/webcrawler.html) [NC]
RewriteRule ^(.*)$ - [F]

or (NC=ignore_case, F=fail_deny_access)

RewriteCond %{HTTP_USER_agent} env=80legs [NC]
RewriteRule ^(.*)$ - [F]

Visit the Apache web site for more information on using mod_rewrite.

Banning web access using mod_GeoIP

Based on the heavy access of this bot (in this instance) I could simply ban ALL access from RU, UA and IN. Using mod_geoip requires a bit of setup and will also require ‘maintenance’; limitations include ‘missing’ or changing IP data so it may not catch/stop all access – but it is quick/effective when it does work.  This requires multiple settings:

  1. you can limit this by ‘folder’ or use ‘/’ to block for the entire site
  2. set the BlockCountry variable
  3. block the access
### Ukraine
 SetEnvIf GEOIP_COUNTRY_CODE UA BlockCountry
 ###  Russian Federation
 SetEnvIf GEOIP_COUNTRY_CODE RU BlockCountry
<Directory />
   Order deny,allow
   Deny from env=BlockCountry
   Allow from all
</Directory>

Banning web access using mod_Security

Using mod_security provides many more options – the simplest is to use the same USER_AGENT information and then block.  The advantage with mod_security is that you can create one ‘rule’ and then update a lookup file with multiple ‘bad agents’ or ‘bad_ips’ to block access.  This approach also provides a means for automating the ‘blacklist’.  Visit the related posts or search this site for more details covering using mod_security (or mod_geoip.)

Simple example – similar to re-write rule above except that the bot is redirected to the robots.txt file:

SecRule REQUEST_HEADERS:User-Agent "80legs"  log,redirect:/robots.txt

A more complex, multi-file setup is required to block by IP address – you could create a rule for every IP address but it is simpler to create one rule and match banned IPs from the blacklist file.

# set IP Var below for following rule
 SecAction "phase:1,pass,nolog,setvar:tx.REMOTE_ADDR='/%{REMOTE_ADDR}/'"
 SecRule TX:REMOTE_ADDR "@pmFromFile banned.80legs.txt" "phase:1,deny,status:403,msg:'Blacklist_80_Legs_Spdr'"

In the file ‘banned.80legs.txt‘ you simply list the IP address(es) or IP address patterns you wish to block, one IP per line; lines beginning with ‘#’ are ignored as comments.  Note that entries bounded by  slashes (/) are for unique IP addresses; if you remove the last forward slash and the last octet of the IP then you can match a sub-net.

################################
# Sat Feb  4 17:17:00 EST 2012 #
################################
### 1.187.254.227    |  IN, N/A, N/A, N/A, 20.000000, 77.000000, 0, 0
/1.187.254.227/
### 2.132.10.99      |  KZ, 02, Almaty, N/A, 43.250000, 76.949997, 0, 0
/2.132.10.
### 2.135.27.17      |  KZ, 13, Kostanay, N/A, 53.166698, 63.583302, 0, 0
/2.135.27.

At this point:

  1. I am blocking ~1100 IP addresses used by the 80legs bot on the day of it’s ‘visit’
  2. I am blocking via the ‘USER_AGENT’ header for any new 80legs spider
  3. I scan my logs and automatically add new IP addresses to the ban list.

Note that it is more likely that the IP addresses used by this bot are data-center based and/or running on compromised/hacked PCs.  If I noted more activity from US IP space then I would be more concerned about blocking ‘good guys’.

As always your mileage will vary – testing is encouraged and  make sure to keep backups of your system files.  🙂

Topics: Computer Technology, Internet Search, System and Network Security, Web Problem Solving, Web Technologies | Comments Off on Banning ‘Bad’ Robots & Crawlers

Comments are closed.


________________________________________________
YOUR GeoIP Data | Ip: 73.21.121.1
Continent: NA | Country Code: US | Country Name: United States
Region: | State/Region Name: | City:
(US only) Area Code: 0 | Postal code/Zip:
Latitude: 38.000000 | Longitude: -97.000000
Note - if using a mobile device your physical location may NOT be accurate...
________________________________________________

Georgia-USA.Com - Web Hosting for Business
____________________________________