IUG 2009 – Robots-Crawlers-Spiders – Automated Searches and Your WebPAC

Mark Welge

Innovative Interfaces


Robots can cause increased load on the public catalog.

The crawer tries to follow every link embedded in catalog pages.

The creawlers send search requests at very high volume and speed.


Robots exclusion protocol – depend on voluntary cooperation on the part of search engine providers.



Read from directory above "/screens"

Publicly viewable

Might be ignored by an ill-behaved crawler


This file is publicly viewable, but it is not directly controllable or configurable by the library.


Innovative’s Strategy with Robots.txt

Allow access to mainmenu.html

Give legitimate search engines a chance to index the main page of catalog

Update robots.txt file with software releases

Extend blocking to new command links




Robots.txt allows Googlebot for google Scholar. This allows crawling of both / and /screens.


Recognizing a problem with crawlers

System slowness

Numerous searches submitted from an "outside" IP address

In a very short time span

In systematic patterns not typical of human users


Check "non-local access attempts allowed" through the character-based interface.




If an IP lookup on an address returns something suspicious, add an entry to the http access table and set the access value to no.


Usage analysis in Release 2007

Apache server

Layer in front of WebPAC

Logging of search activity

Available a zip file 1 day later

Downloadable for analysis with 3rd-party tools

These logs are maintained fror 30 days


Retrieval of "robots.txt" by well-behaved crawlers will be posted to this log.


Searching for apache in the list of process

Restart terminal menu

Show all, Limit by httpd

One thought on “IUG 2009 – Robots-Crawlers-Spiders – Automated Searches and Your WebPAC”

  1. It is a SEO friendly – human edited link directory for every webmaster with a family friendly website. Submit your site today and our editor will review it for fast inclusion in to dirblogger.com.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s