Deep Web Indexing

I came across an interesting New York Times article several days ago: Exploring a ‘Deep Web’ That Google Can’t Grasp. The article explores a shortcoming of current search technologies that librarians have known about and struggled with for quite some time. As good as current search engines may be, they rely primarily on crawlers or spiders that essentially trace a web of links to their ends. That works for a lot of content out on the Internet, but it doesn’t do so well for information contained in databases. So . . . library catalogs, digital library collections, a lot of the things that libraries do aren’t being picked up by the major search engines.

Of course at some level that makes perfect sense. When a web crawler comes to a page with a search box, how is it supposed to know what to do? It needs to input search terms to retrieve search results, but what search terms are appropriate? Is it searching an online shopping website? A tech support knowledgebase? A library catalog? This discussion surfaces again and again particularly as we talk about one of our digital collections. There is a wealth of information here for people researching the history of accounting, but it resides in a database. The database works perfectly well for humans doing a search. The only problem is that they have to find out about the database first. Now we’ve done a number of things to get the word out: papers, conference presentations, a Wikipedia article . . . If we’re lucky, these things will get users to the top level of the collection. Hopefully once they’re there, their research will draw them in. (In case anyone notices, I should get credit for positioning that set of homonyms like that!)

But getting them there in the first place – that’s the hard part. That’s why I have so much hope for deep web indexing. If researchers can build tools that will look into our databases intelligently, then extensive new levels of content will ben opened up to everyone. In particular I think about students who decide that the first few search engine hits are “good enough” for their school project. Usually they’re not good enough, but the students don’t always realize that. If new search engines can truly open up the deep web, the whole playing field changes!