Bot filtering in Apache

Some people filter robots, spiders, and web crawlers to only allow a few with robots.txt. I prefer to filter email collectors I know have a bad reputation. Since not every bot bothers to even check for robots.txt, I use the Limit restriction to ban certain bots from deekayen.net using their user agent string. The smart bot creators allow users to change the user agent string, so this method isn't foolproof, but it lets me sleep better at night.

This is what I put in .htaccess. There are probably better ways to write the regular expressions, but I haven't done a benchmark to know if lots of single line regular expressions are better in Apache or one big long string.

SetEnvIfNoCase User-Agent "Address" bannedSetEnvIfNoCase User-Agent "^anarchie" bannedSetEnvIfNoCase User-Agent "^almaden" bannedSetEnvIfNoCase User-Agent "^CherryPicker" bannedSetEnvIfNoCase User-Agent "^Chilkat" bannedSetEnvIfNoCase User-Agent "^Clushbot" bannedSetEnvIfNoCase User-Agent "^ContactBot" bannedSetEnvIfNoCase User-Agent "^crescent" bannedSetEnvIfNoCase User-Agent "^CydralSpider" bannedSetEnvIfNoCase User-Agent "^DBrowse" bannedSetEnvIfNoCase User-Agent "^Demo" bannedSetEnvIf User-Agent "^DYNAMIC$" bannedSetEnvIfNoCase User-Agent "^EBrowse" bannedSetEnvIfNoCase User-Agent "^eCatch" bannedSetEnvIfNoCase User-Agent "^EmailCollector" bannedSetEnvIfNoCase User-Agent "^EMAILsearcher$" bannedSetEnvIfNoCase User-Agent "^EmailSiphon" bannedSetEnvIfNoCase User-Agent "^EmailWolf" bannedSetEnvIfNoCase User-Agent "^exactseek-pagereaper-" bannedSetEnvIfNoCase User-Agent "^ExtractorPro" bannedSetEnvIfNoCase User-Agent "^Franklin" bannedSetEnvIfNoCase User-Agent "^Full" bannedSetEnvIfNoCase User-Agent "^Hatena" bannedSetEnvIfNoCase User-Agent "^InfociousBot" bannedSetEnvIfNoCase User-Agent "^IUPUI" bannedSetEnvIfNoCase User-Agent "LARBIN" bannedSetEnvIfNoCase User-Agent "^Lincoln" bannedSetEnvIfNoCase User-Agent "^Missauga" bannedSetEnvIfNoCase User-Agent "^Missouri" bannedSetEnvIfNoCase User-Agent "^Miva" bannedSetEnvIfNoCase User-Agent "^NaverBot_dloader" bannedSetEnvIfNoCase User-Agent "^NetCarta_WebMapper" bannedSetEnvIfNoCase User-Agent "^Netprospector" bannedSetEnvIfNoCase User-Agent "^nicebot" bannedSetEnvIfNoCase User-Agent "^NICErsPRO" bannedSetEnvIfNoCase User-Agent "^Nudelsalat" bannedSetEnvIfNoCase User-Agent "^Nutch" bannedSetEnvIfNoCase User-Agent "OASIS" bannedSetEnvIfNoCase User-Agent "^Pajaczek" bannedSetEnvIfNoCase User-Agent "^PeerFactor" bannedSetEnvIfNoCase User-Agent "^PEval" bannedSetEnvIfNoCase User-Agent "^Port" bannedSetEnvIfNoCase User-Agent "^Production" bannedSetEnvIfNoCase User-Agent "^Program" bannedSetEnvIfNoCase User-Agent "^ProWebWalker" bannedSetEnvIfNoCase User-Agent "^Relevare" bannedSetEnvIfNoCase User-Agent "Ripper" bannedSetEnvIfNoCase User-Agent "^SeznamBot" bannedSetEnvIfNoCase User-Agent "^sna" bannedSetEnvIfNoCase User-Agent "^SpiderMan$" bannedSetEnvIfNoCase User-Agent "^SquigglebotBot" bannedSetEnvIfNoCase User-Agent "Surf" bannedSetEnvIfNoCase User-Agent "^Tarantula" bannedSetEnvIfNoCase User-Agent "^Talkro" bannedSetEnvIfNoCase User-Agent "^TheInformant" bannedSetEnvIfNoCase User-Agent "^Thunderstone" bannedSetEnvIfNoCase User-Agent "^Under" bannedSetEnvIfNoCase User-Agent "^VengaBot" bannedSetEnvIfNoCase User-Agent "^WebEMailExtrac.*" bannedSetEnvIfNoCase User-Agent "^WebEnhancer" bannedSetEnvIfNoCase User-Agent "^WebMiner" bannedSetEnvIfNoCase User-Agent "^Wells" bannedSetEnvIfNoCase User-Agent "www4mail" bannedSetEnvIfNoCase User-Agent "^yoono" bannedSetEnvIfNoCase User-Agent "^ZoomInfo" banned<Limit GET POST HEAD>  order allow,deny  allow from all  deny from env=banned</Limit>


Post new comment

The content of this field is kept private and will not be shown publicly.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
  • Allowed HTML tags: <hr /> <a> <p> <em> <strong> <cite> <code> <blockquote> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

CAPTCHA
Your post will remain unpublished until David examines it himself to check for spam.