Deep Web Indexing

I came across an interesting New York Times article several days ago: Exploring a ‘Deep Web’ That Google Can’t Grasp. The article explores a shortcoming of current search technologies that librarians have known about and struggled with for quite some time. As good as current search engines may be, they rely primarily on crawlers or spiders that essentially trace a web of links to their ends. That works for a lot of content out on the Internet, but it doesn’t do so well for information contained in databases. So . . . library catalogs, digital library collections, a lot of the things that libraries do aren’t being picked up by the major search engines.

Of course at some level that makes perfect sense. When a web crawler comes to a page with a search box, how is it supposed to know what to do? It needs to input search terms to retrieve search results, but what search terms are appropriate? Is it searching an online shopping website? A tech support knowledgebase? A library catalog? This discussion surfaces again and again particularly as we talk about one of our digital collections. There is a wealth of information here for people researching the history of accounting, but it resides in a database. The database works perfectly well for humans doing a search. The only problem is that they have to find out about the database first. Now we’ve done a number of things to get the word out: papers, conference presentations, a Wikipedia article . . . If we’re lucky, these things will get users to the top level of the collection. Hopefully once they’re there, their research will draw them in. (In case anyone notices, I should get credit for positioning that set of homonyms like that!)

But getting them there in the first place – that’s the hard part. That’s why I have so much hope for deep web indexing. If researchers can build tools that will look into our databases intelligently, then extensive new levels of content will ben opened up to everyone. In particular I think about students who decide that the first few search engine hits are “good enough” for their school project. Usually they’re not good enough, but the students don’t always realize that. If new search engines can truly open up the deep web, the whole playing field changes!

The Latest on CONTENTdm: New Capabilities, New Possibilities

Excerpts from OCLC’s presentation. Their full presentation will be posted online after the conference.

 

CONTENTdm 5 will use Webalyzer for reports.

 

CONTENTdm will be added to the FirstSearch Base Package and will include

Full-function CONTENTdm hosted by OCLC

3 Project Clients for collection building (items also may be added with the simple web add form or Connexion digital import)

3,000 item limit and 10 GB storage

Available May 1, 2009

 

Digital Collection Gateway

Improve access and presence for digital collections

Synchronize non-MARC metadata with WorldCat

 

CONTENTdm 5

 

1. Unicode

Full support of Unicode for importing, storing, displaying and searching Unicode languages

OCR support expanded – 184 languages

Supports the creation of ditial collections in any language

 

2. Find Search Engine (used for WorldCat)

Find search engine integrated into CONTENTdm software

More robust capability and the ability to offer additional search features

Relevancy

Faceted searching

Spelling suggestions

Unicode searching

Search in any language

 

3. Controlled Vocabularies

Adds efficiency to colleciton building by providing pre-loaded thesaurifor cataloging

Integration with OCLC Terminologies Service

Providing nine new thesauri for CONTENTdm users

 

4. Reports

More robust, scalable reporting module integrated into software

Provides expanded reports

Views by collection and item

 

5. Flexible workflows

Added more options for approving and indexing items

New batch and subset handling of pending items

One-click approve and index on demand

Scheduling options for approve and index

Background processing

 

6. Registration

New registration process added during installation

One-click sends server information to OCLC

Registered servers called once a month to gather data on usage

 

7. Project Client

New client application replaces old version

New programming language

New, more intuitive interface

Unicode support

More robust

Project Settings Manager – Metadata Templates

Different templates for different file types

Images, JPEG, JPEG2000, TIF

PDF, compound objects, URL, audio, video

Options for generating data from different file types

Images – Colorspace, Bits per sample

PDF – Extracts content from embedded fields (application, author, date modified, date created, etc.)

 

8. File Transfer

Replaced FTP with custom HTTP transfer protocol

Uploading items occurs in the background

Continue working while items are uploaded

Pause process and resume later

 

9. EAD

New import process and display options

Custom metadata mapping

Full text searching

Search term highlighting within EAD

Multiple display veiws

XML web service

Users control metadata mapping and display

 

10. Capacity

Increased capacity throughout application

Supports more collections, items for batch processing, and metadata fields.

Expand metadata schemas to incorporate preservation metadata or more custom fields

Faster batch processing and conversion from existing databases

Contributing to Wikipedia

Wikipedia. Love it or hate it, admit it or not, lots and lots of people use it. People continue to express concerns about the accuracy and verifiability of Wikipedia‘s information and rightfully so, but library users are going there for information. Google searches (another favorite of library users) are turning up more and more Wikipedia entries. And librarians are using the site as well. One of the best descriptions of Wikipedia use came from a reference librarian. The basic idea was that when neither the librarian nor the patron know enough about a topic to research it, Wikipedia usually gives a number of relevant keywords and subjects that can guide further research in library resources.

Since USERS ARE GOING THERE, then it’s worthwhile to provide accurate information when and where we can. Now I’m not suggesting that librarians begin poring through the website, ferreting out inaccuracies, and posting updates duly attributed to reputable, verifiable sources. That’s fine if people have the time, but most don’t. No, instead I’m talking about contributing to the wider body of knowledge through Wikipedia when and where it is appropriate.

A couple of days ago I wrote about making some updates to the web pages for our Digital Accounting Collection. As we were talking about the collection, it occurred to me that Wikipedia might be a good place to share information about the collection. An entry might describe some of the collections as well as giving some history on the digitization project. When I came across this University of Florida entry, I was even more convinced that this was a good idea.

I created my account, and started experimenting in the sandbox. I worked with a couple of colleagues to develop the entry, and I posted an entry on the Digital Accounting Collection today. It was an interesting process to work through. Interestingly, this is the first Wikipedia entry for our library — interesting in an ironic sort of way I suppose. The Digital Accounting Collection was our first fully searchable digital collection, so I guess it’s only fitting that our first Wikipedia entry is about this collection.

So there it is. It’s out there. Since we wrote about things that we know and have worked with ourselves, the information is as accurate as it can possibly be to the best of our knowledge.

At least until somebody else edits it.  😉

Digital Accounting Collection Updates

It’s that time of year again. A colleague is getting ready for a conference presentation, and we need to do some web page updates for the Digital Accounting Collection. A number of new items have been added, and records for an entirely new collection have been created.

Several years after the fact we’re still proud of this one. This project was our library‘s first fully searchable online digital collection. It was a big project for us. Perhaps making the digitized items full-text searchable was a bit ambitious for our first digitization project, but it works and there is a lot of good content here!