Robots can cause increased load on the public catalog.
The crawer tries to follow every link embedded in catalog pages.
The creawlers send search requests at very high volume and speed.
Robots exclusion protocol – depend on voluntary cooperation on the part of search engine providers.
Read from directory above "/screens"
Might be ignored by an ill-behaved crawler
This file is publicly viewable, but it is not directly controllable or configurable by the library.
Innovative’s Strategy with Robots.txt
Allow access to mainmenu.html
Give legitimate search engines a chance to index the main page of catalog
Update robots.txt file with software releases
Extend blocking to new command links
Robots.txt allows Googlebot for google Scholar. This allows crawling of both / and /screens.
Recognizing a problem with crawlers
Numerous searches submitted from an "outside" IP address
In a very short time span
In systematic patterns not typical of human users
Check "non-local access attempts allowed" through the character-based interface.
If an IP lookup on an address returns something suspicious, add an entry to the http access table and set the access value to no.
Usage analysis in Release 2007
Layer in front of WebPAC
Logging of search activity
Available a zip file 1 day later
Downloadable for analysis with 3rd-party tools
These logs are maintained fror 30 days
Retrieval of "robots.txt" by well-behaved crawlers will be posted to this log.
Searching for apache in the list of process
Restart terminal menu
Show all, Limit by httpd