IUG 2009 – Robots-Crawlers-Spiders – Automated Searches and Your WebPAC

Mark Welge

Innovative Interfaces

 

Robots can cause increased load on the public catalog.

The crawer tries to follow every link embedded in catalog pages.

The creawlers send search requests at very high volume and speed.

 

Robots exclusion protocol – depend on voluntary cooperation on the part of search engine providers.

 

Robots.txt

Read from directory above "/screens"

Publicly viewable

Might be ignored by an ill-behaved crawler

 

This file is publicly viewable, but it is not directly controllable or configurable by the library.

 

Innovative’s Strategy with Robots.txt

Allow access to mainmenu.html

Give legitimate search engines a chance to index the main page of catalog

Update robots.txt file with software releases

Extend blocking to new command links

 

http://my.library.edu/robots.txt

 

Robots.txt allows Googlebot for google Scholar. This allows crawling of both / and /screens.

 

Recognizing a problem with crawlers

System slowness

Numerous searches submitted from an "outside" IP address

In a very short time span

In systematic patterns not typical of human users

 

Check "non-local access attempts allowed" through the character-based interface.

 

http://www/.hostip.info

 

If an IP lookup on an address returns something suspicious, add an entry to the http access table and set the access value to no.

 

Usage analysis in Release 2007

Apache server

Layer in front of WebPAC

Logging of search activity

Available a zip file 1 day later

Downloadable for analysis with 3rd-party tools

These logs are maintained fror 30 days

 

Retrieval of "robots.txt" by well-behaved crawlers will be posted to this log.

 

Searching for apache in the list of process

Restart terminal menu

Show all, Limit by httpd

Advertisements

IUG 2009 – Millennium and XML – Repurposing and Customizing Catalog Metadata

Dao Rong Gong

Lucas Mak

Michigan State University

 

As a quick note to self, this looks like it could be very useful for a pending project I have in mind. Can’t wait to get my hands on the conference presentation handout.

 

Innovative uses its own type of XML data. This can be retrieved through HTTP queries.

 

The data arrangement is based on MARC fields, but MARC fields and their subfield are siblings.

Two types of XML records that can be retrieved from Millennium: Brief records and full records

 

The Millennium System and XML

Encore has built-in functionality that allows it to harvest OAI-compliant services.

 

XSLT

Manipulation of XML documents by creating a new document based on

 

XSLT uses Xpath expressions to select/filter the data node

 

Sunday School books collection

Needed to batch load records into Content Pro. Original data source is based on MARC.

 

One option is to create a list of records as a review file. Records could then be converted to Qualified Dublin Core using MarcEdit.

 

Used an HTTP query to request the Innovative XML. Then turned that into Qualified Dublin Core with XSLT.

 

Issued with Converting Innovative XML Data

 

Data is structured differently from MARC21 XML

Availability of existing "Innovative XML to DC/QDC" XSLT?

 

Not optimized for data manipulation

Complications in data selection

Selection of data node by matching criteria against values in individual elements

A series of matching may be needed just to be able to select one node

IUG 2009 – Encore Implementation – One Academic Library’s Experience

Christopher Brown, Elizabeth Meagher, Sandra Macke

Penrose Library, University of Denver

 

This library didn’t do any advance publicity in advance of their Encore deployment. They just turned it on and let it run. There was only one complaint from a single faculty member.

 

With this approach they felt that the system was intuitive enough that patrons would just "get it," and that has proven to be the case.

 

Professors can use community tagging as a way to create virtual reserves.

 

The library selected keyword searching as the starting point for several reasons.

More Google-like

Works quite well for known-item searching

Is much faster than regular catalog

Has no limits to result set (traditional opac maxes out at 32,000 results)

 

During the implementation, they found that serials appear at the top of the results sets. This library considers this an added bonus to Encore.

 

The library hope to make their catalog become the ultimate reference tool for their institution. Along with this, they have a goal to make the searching the OPAC at least as easy as browsing the reference shelves.

 

The library currently has about 250 tags created through the community tagging system.

 

The library rolled out what was a fairly canned implementation of Encore. After rolling out, they began revising the catalog based on input rolling in from students and reference librarians.

 

Encore also helped reveal things relevant to database maintenance. For example, when a feature film showed up with a 3-D object facet, this threw up a red flag that something was wrong. When you’re skimming through records and facets, you can look for the things where only one item appears. Surprisingly, this helps problems bubble to the surface so that they can be found and corrected.

 

An audience question led to a reiteration of the point that the Penrose Library just put this out there. They bought it, turned it on, and let the public have at it. From Christopher: "Customers today understand that. Google does this all the time. They put something out there, and if it breaks, it breaks. We did it that way, and I would do it again."

 

"The first day out of the box it’s pretty good. Then you just tweak it."

 

Encore cannot handle all of the JavaScript enhancements that are possible in the traditional catalog.

Top Tech Trends – Denver, pt. 2

ALA Midwinter 2009 Trendsters: Marshall Breeding, Karen Coombs, Roy Tennant, Clifford Lynch, Karen Schneider, Karen Coyle

 

Karen Coyle – A lot of what’s happening is not new technology but issues around management of technology

 

Karen Schneider – Recapturing tools creation. 80s-90s – dark ages where other people were creating the tools for us.

 

Clifford Lynch – Flickr commons. Library of Congress and New York Public putting photos online. Some people are looking at ways to re-import this information into their own databases.

 

Question – Is there anything that has been the proof of the pudding that librarians can build and maintain our own tools?

 

Karen Schneider – The test for open-source software seems to be whether it can move past the founding library or founding community. The verdict is still out on whether it can be successful in the long run.

 

Karen Coyle – If software is not allowed to fork in different directions, we’re locked into the same old model where everyone is doing exactly the same thing.

 

Forking (def.) – when a project divides significantly enough so that there is no one thing that people refer to as the core code.

 

Roy Tennant – Flickr Commons – We need to find ways to feed that information back into our systems more easily. Catalogers trying to feed that information back into our systems is not going to scale.

 

Clifford Lynch – People went to Flickr because it was there and it had a user base. What is significant is that it builds bridges between existing stores of knowledge.

 

Clifford Lynch – Widespread markup of biographical and historical narratives.

 

Karen Coyle – With the ubiquity of global positioning, information is going to be more location contextual.

 

Marshall Breeding – It’s going to take a while to get there.

 

Karen Coombs – There is a point at which GPS just isn’t good enough. Users need help finding items even within the building.

 

Clifford Lynch – GPS has largely been used for driving directions or missile strikes. There is a whole set of technologies that can be used to narrow this down much more. Now that GPS is moving ubiquitously into cell phones, we’ll see a second generation of spatial applications.

 

Marshall Breeding – We’re already getting location-targeted information. When we surf the web in a new city, we get location-targeted ads.

 

Karen Coombs – Geographical-based services. Too many locations are looking at IP address or asking users to input a zip code. Systems need to consider that where you are physically doesn’t necessarily have anything to do with your affiliation.

 

Karen Coombs – Google Scholar lets you set institutions with which you are affiliated.

 

Karen Coyle – Open street map for libraries. People are walking around with GPS units and replicating Google street view with an open

 

Roy Tennant – People putting data on the web through stable URIs. We’re looking at putting data out. It will be interesting to see what kind of linkages people make with that data.

 

Marshall Breeding – What are some examples?

 

Roy Tennant – We don’t know yet, and that’s the interesting part. What will people find to do with it?

 

Clifford Lynch – In scientific communities people

 

Roy Tennant – Small slice of a particular discipline.

 

Question from audience – Does the new ORE standard have implications for this?

 

Karen Coyle – Data elements have to be on the web.

 

Clifford Lynch – ORE is really intended to allow to you work with objects or groups of objects rather than the metadata about those objects. It’s built to be consistent with semantic web standards.

 

Karen Coombs – ORE is a good for moving the objects themselves.

 

Karen Coyle – We have the amoeba form of linked data in hypertext. But all we have is a link that doesn’t tell you anything about what it means, and it’s only one-way. How do we get the links to be meaningful?

 

Karen Coombs – We code HTML in the simplest way possible and don’t use it to its full potential.

 

Karen Schneider – I think I’m seeing some controlled burn in libraries due to economic pressures. They’re having to make hard decisions that they would not otherwise have had to make. Public libraries have never had higher traffic but they’ve never had such economic pressures.

 

Karen Coyle – Public libraries circulating 3-4 times their collection every year can make a good argument for RFID. Maybe more difficult for academics.

 

Karen Schneider – If you were opening a new library tomorrow, you’d have to think about RFID and self-checkout.

 

Karen Coyle – Most libraries in study made the switch to RFID when opening a new branch or doing renovation.

 

Karen Coombs – How many ILL requests do people cancel because you have it already or because you don’t loan textbooks. We have to work smarter so we’re

 

Karen Schneider – How about RFID for item location in the stacks.

 

Karen Schneider – One vendor using advanced shipping notices for acquisitions. ASN is used ubiquitously in the commercial book world. Almost unknown in libraries.

 

Marshall Breeding – We’re concerned about processes and our control of material – not just how to fulfill user needs. We need to find a way to get that one-click user satisfaction.

 

Karen Coombs – Books have to go to cataloging and then to shelves or reserve. It would make patrons much happier if it went directly to faculty.

 

Karen Coyle -RFID in public libraries for self-check – much faster. Libraries that have a high level of self-check also circulate a high-level of self-help materials since they don’t have to pass those materials through a staff member. More privacy.

 

Audience comment – No lines for check-out, but longer lines for check-in because the automated technology can’t keep up.

 

Karen Schneider – Brisbane, Australia – Amazing city library that is completely self-check. You can also watch robots check in materials. It takes something mundane and makes it fun and entertaining. Humans are used intelligently for error handling, and let automation do what it does well.

 

Karen Schneider – You don’t want to tie people to routine, mundane tasks when they could be roaming around helping users.

 

Karen Schneider – There is one library that uses a biometric station for patrons who have forgotten their library cards.

 

Karen Coombs – We have think carefully about our processes and apply cost effective solutions. How many times does someone from systems have to work on a malfunctioning piece of hardware before we just replace it.

 

Karen Schneider – Total neglect of getting good bandwidth to the extreme ends of rural areas. Very forward thinking rural libraries that are hampered by limited bandwidth. It’s not a money problem, it’s an end-of-the-road problem.

 

Karen Coombs – Utility companies (cable, cell, etc.) think it’s not cost effective to provide services in some areas.

 

Clifford Lynch – this is a public policy problem.

 

Marshall Breeding – The lack of bandwidth to rural libraries has an impact on how they automate. Can they do resource sharing? Can they participate in consortia?

 

Audience comment – Large new Gates program addressing rural telecommunications.

 

Karen Schneider – That’s wonderful, but it’s going to be a drop in the bucket.

 

Karen Coombs – Technology is like a ravenous puppy running around eating the whole house. If libraries can’t get funding to continuously replace equipment, it quickly goes back to being bad.

 

Marshall Breeding – WiMax is supposed to solve some of the bandwidth problems. It just hasn’t solved the problems.

 

Karen Coombs – Some rural success stories come from municipalities that have partnered to provide higher bandwidth to residents.

 

Karen Coyle – Open and closed models of sharing data. Closed models are easy to understand. Open allows innovation, but it’s harder to understand the business model. I hope we’re beginning to understand the difference in databases and the web as our data platform.

There are a number of people trying to use technology to solve rights questions.

 

Karen Schneider – The death of print publishing. It’s on life support. We’re seeing the death of paper with newspapers and magazines. For those of us who have been publishing in the traditional paper world, this is very serious.

We’re starting to see sensible measurements of the carbon footprint in data centers.

 

Marshall Breeding – I fly only on plug-in hybrid planes!

 

Clifford Lynch – Newspapers seem to be melting down economically

Newspapers have ramifications for community building and community definition. If these move only to the web the question of how they’re archived changes in a radical way. The way people interact with displays is beginning to change. New generations of technology – e-ink, desktops with multiple monitors is commonplace.

Libraries are still locked into single-screen setups.

Recent study about higher ed costs have changed. Argues that all of the cost increases have gone into administration and overhead rather than teaching. The data looks strange because technology is lumped under overhead.

Evidence based studies about how technology enhances teaching and learning.

 

Roy Tennant – I don’t see the book publishing industry melting down.

There are new ways to publish that were not available before.

 

Clifford Lynch – Books – Distribution of what’s being published is changing. Authors are getting different options.

If libraries want to collect books, it’s no longer adequate to just look at what’s coming out of traditional publishing.

 

Karen Schneider – Book publishing is in serious trouble.

 

Roy Tennant – More important to focus on making good technology decisions.

How do we decide when to jump in? How do we decide when to get out?

 

Karen Coombs – What it takes to do true digital preservation – It’s very scary. Collections we rely on that other people curate. I don’t have a lot of confidence.

 

Clifford Lynch – The stuff that is already digital is probably in better shape than other things.

 

Karen Coombs – Some of the smaller journals – if they can’t get their content on the web, then I don’t trust their preservation.

 

Marshall Breeding – I worry about libraries not doing long-term digital preservation. Local libraries don’t necessarily have the resources to do that.

This is not something that every library needs to reinvent. There are a lot of local installations.

Discovery interfaces. Much work is being done on these be-all, end-all solutions. Looking for better ways to expose library collections and services.

An urgency to libraries to prevent a better front end to our users, but we are sluggish about doing it. We’re taking our usual slow-and-cautious, wait until it’s perfect approach.

Taking user-supplied content and improving it through web 2.0 features.

LibraryThing for Libraries being distributed through Bowker.

Open source companies – Open source is getting good, but not great reviews. Maybe some growing pains as software matures.

 

Clifford Lynch – If you’re a smaller scale library (smaller than national or major research)

We need to do a better job on collaborative arrangements, external services that smaller institutions can acquire.

Smaller libraries often simply cannot afford substantial preservation programs on their own. This is an incredibly hard problem because nobody wants to fund this stuff.

 

Marshall Breeding – Is has to be done as a collaborative effort. It’s simply too big and too expensive to be done library by library.

Talking Tags

We’re currently in the midst of a library catalog redesign. Last week we had an open discussion forum to look at some of the new features that are available in the new release as well as some optional enhancements. On the enhancements side we looked at a book jacket service as well as LibraryThing. LibraryThing offers a set of enhancements which include tags, similar books, and other editions. I personally find the similar books feature to be very useful, and of course tag clouds are beginning to show up in more places.

LibraryThing sparked several interesting discussions. Two of the discussion points focused on issues that I’ve heard a number of times when discussing tagging. One point emphasizes the detail and specificity of Library of Congress subject Headings. The other point highlights the ability of keyword searches to retrieve content that users need. If we already have subject Headings and keyword searches, why do we need tags?

I think this is a valid question which deserves an answer, but perhaps not necessarily the answer one might expect. I don’t think of tags as a replacement for subject headings or keyword searches. Instead, the tags provide a function that goes directly to the core of web 2.0 technologies. Tags allow users to organize and interact with content in a way that is meaningful to them. Tags may also help users find books in the catalog, and it’s great if that happens. But I think it’s more significant that tags allow users to truly work with the content contained in the library catalog rather than just passively reading a screen and perhaps jotting down a call number.

In the end I don’t know if we’ll add LibraryThing enhancements to our catalog, but it’s definitely worth consideration. The product that has a lot of promise, and it sparked some interesting discussion during our forum. Many people saw the immediate value that these enhancements could add to the library catalog. Perhaps most importantly, we had a number of students present for our discussion, and THEY saw the benefits these enhancements would bring. That’s what’s it really all about, after all.