IUG 2009 – Robots-Crawlers-Spiders – Automated Searches and Your WebPAC

Posted: May 20, 2009 in conferences, Innovative Interfaces, Innovative Users Group, integrated library system, IUG, iug 2009, library catalog, search engines
Tags: , ,

Mark Welge

Innovative Interfaces

 

Robots can cause increased load on the public catalog.

The crawer tries to follow every link embedded in catalog pages.

The creawlers send search requests at very high volume and speed.

 

Robots exclusion protocol – depend on voluntary cooperation on the part of search engine providers.

 

Robots.txt

Read from directory above "/screens"

Publicly viewable

Might be ignored by an ill-behaved crawler

 

This file is publicly viewable, but it is not directly controllable or configurable by the library.

 

Innovative’s Strategy with Robots.txt

Allow access to mainmenu.html

Give legitimate search engines a chance to index the main page of catalog

Update robots.txt file with software releases

Extend blocking to new command links

 

http://my.library.edu/robots.txt

 

Robots.txt allows Googlebot for google Scholar. This allows crawling of both / and /screens.

 

Recognizing a problem with crawlers

System slowness

Numerous searches submitted from an "outside" IP address

In a very short time span

In systematic patterns not typical of human users

 

Check "non-local access attempts allowed" through the character-based interface.

 

http://www/.hostip.info

 

If an IP lookup on an address returns something suspicious, add an entry to the http access table and set the access value to no.

 

Usage analysis in Release 2007

Apache server

Layer in front of WebPAC

Logging of search activity

Available a zip file 1 day later

Downloadable for analysis with 3rd-party tools

These logs are maintained fror 30 days

 

Retrieval of "robots.txt" by well-behaved crawlers will be posted to this log.

 

Searching for apache in the list of process

Restart terminal menu

Show all, Limit by httpd

Advertisements
Comments
  1. It is a SEO friendly – human edited link directory for every webmaster with a family friendly website. Submit your site today and our editor will review it for fast inclusion in to dirblogger.com.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s