View Full Version : robot.txt
Hi,
I have some forums being harvested by some search engine bots
whats the best robot.txt configuration to block ALL of search engine on that forums Only ?
Thanks in advance.
BornOnline
08-25-2005, 04:31 PM
Depending on the search engine, it may not even look at your robots.txt.
Validator (http://www.searchengineworld.com/cgi-bin/robotcheck.cgi)
User-agent: *
Disallow: /forum/
The following allows all robots to visit all files because the wildcard "*" specifies all robots.
User-agent: *
Disallow:
This one keeps all robots out.
User-agent: *
Disallow: /
The next one bars all robots from the cgi-bin and images directories:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
This one bans BadSearch from all files on the server:
User-agent: BadSearch
Disallow: /
This one bans keeps googlebot from getting at the whatever.htm file:
User-agent: googlebot
Disallow: whatever.htm
You can't block all bots... some of them doesn't respect the standard robots.txt ...
If you have too many bots that doesn't respect the robots.txt, you can use .htaccess to block them... it's pretty easy.
Something like this:
SetEnvIfNoCase User-Agent "^The_super_bot" bad_bot
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>
Thank you, I'll try it and report
hey is there anyway to test if .haccess method works fine instead of waiting for the next crawel ?
well... probably by using a browser where you can change your user-agent ?
If you use firefox... there's an extension that can do it i think...
I'm watching them crawling me now heh
tcpdump -i venet0 port 80
I'm not sure if i'm being slashdotted or not
but see this
17930 23.03% Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com
2744 3.52% Googlebot/2.1 (+http://www.google.com/bot.html)
i think that if you were slashdotted, you will notice far more than bots ;)
Check the raw logs for a better investigation ...
vBulletin® v3.7.3, Copyright ©2000-2008, Jelsoft Enterprises Ltd.