As part of my research for a new book I’m writing, I was digging around in my Ask Dave Taylor Web site just to see how the most recent Web browsers identify themselves. Much to my surprise, there are literally hundreds of different crawlers hitting the site now, over and above the usual 20-30 popular Web browsers. Crawlers that I’ve never heard of from sites — when they’re identified at all — that are equally unfamiliar to me.
Here’s a pile of different robots and crawlers I found in my log file, all visiting within a single 24 hour period:
- Amfibibot/0.06 (Amfibi Robot; http://www.amfibi.com)
- Baiduspider+(+http://www.baidu.com/search/spider.htm)
- BecomeBot/1.23; +http://www.become.com/webmasters.html)
- BecomeBot/2.0beta; +http://www.become.com/webmasters.html)
- blogsnowbot (+http://www.blogsnow.com/bot.html)
- boitho.com-dc/0.xx (http://www.boitho.com/dcbot.html)
- Enterprise_Search/1.00.143;MSSQL (http://www.innerprise.net/es-spider.asp)
- everyfeed-spider/1.0 (http://www.everyfeed.com)
- FAST Enterprise Crawler 6 (Experimental)
- HenryTheMiragoRobot (http://www.miragorobot.com/scripts/mrinfo.asp)
- HooWWWer/2.0.9 (+http://cosco.hiit.fi/search/hoowwwer/)
- Iltrovatore-Setaccio/1.2 (It-bot; http://www.iltrovatore.it/bot.html)
- msnbot/0.3 (+http://search.msn.com/msnbot.htm)
- NewzCrawler/1.7 (Newz Crawler
- NextopiaBOT (+http://www.nextopia.com)distributed crawler client beta
- NPBot (http://www.nameprotect.com/botinfo.html)
- NusEyeFeedCrawler/0.005 (cs.northwestern.edu);
- NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html)
- psbot/0.1 (+http://www.picsearch.com/bot.html)
- Spider-Sleek/2.0 (+http://search-info.com/linktous.html)
- SpurlBot/0.2)
- SurveyBot/2.3 (Whois Source)
- Trampel-Bot (www.trampelpfad.de)
- TutorGigBot/1.5 ( +http://www.tutorgig.info )
- Vagabondo/2.0 MT; http://aanmelden.ilse.nl/?aanmeld_mode=webhints)
- ZyBorg/1.0 ( http://www.WISEnutbot.com)
Thankfully, many of these are polite enough to include a URL where I can glean more information, but it’s a darn surprise how many there are!
Playing detective for a bit, there are some interesting sites visiting my server, including BecomeBot, which is”the user-agent for Become’s new web crawler. Become is crawling the web to build a next generation search engine.” and TutorGig, which “lists thousands of courses. These courses include not only online courses, but also more traditional courses that are taught in person on or off campus. Users locate courses by searching on keywords of interest. TutorGig.com has a huge database of over a million tutorial sites categorized by more than 2000 subjects.”
Further, I’m sure that some of the crawlers that hit my site are spam tools. When a crawler identifies itself as larbin_2.6.3 larbin2.6.3@unspecified.mail, libwww-perl/5.76, gazz/5.0, Pluck Soap Client/1.0Program Shareware 1.0.2 or HenryTheMiragoRobot, LPW::Simple, or one of my other favorites, Anonymized by Stegos Internet Anonymizer, ya just gotta wonder…
Anyone else being overrun by weird and suspicious bots?
I pull your site’s RSS every morning using Sunrise 0.36. I read most blogs on my Palm Tungsten at work during breaks/lunch.etc.
Re: the become.com spider
They’re launching Feb 10, 2005 and will have 2.2 billion pages in their index… All of which are related to shopping. They’ll be debuting a proprietary algorithm as well.
I have been getting hit with the obidos-bot. Ever heard of that one?
Some Google investigation reveals that the chap who owns this Web site — http://www.onfocus.com/ — is the author of obidos-bot. It also suggests that his ‘bot ignores the robots.txt file and ruleset, frustratingly.