Deep Web Indexing

Posted: March 14, 2009 in digital libraries, digitization, internet, search engines
Tags: , , ,

I came across an interesting New York Times article several days ago: Exploring a ‘Deep Web’ That Google Can’t Grasp. The article explores a shortcoming of current search technologies that librarians have known about and struggled with for quite some time. As good as current search engines may be, they rely primarily on crawlers or spiders that essentially trace a web of links to their ends. That works for a lot of content out on the Internet, but it doesn’t do so well for information contained in databases. So . . . library catalogs, digital library collections, a lot of the things that libraries do aren’t being picked up by the major search engines.

Of course at some level that makes perfect sense. When a web crawler comes to a page with a search box, how is it supposed to know what to do? It needs to input search terms to retrieve search results, but what search terms are appropriate? Is it searching an online shopping website? A tech support knowledgebase? A library catalog? This discussion surfaces again and again particularly as we talk about one of our digital collections. There is a wealth of information here for people researching the history of accounting, but it resides in a database. The database works perfectly well for humans doing a search. The only problem is that they have to find out about the database first. Now we’ve done a number of things to get the word out: papers, conference presentations, a Wikipedia article . . . If we’re lucky, these things will get users to the top level of the collection. Hopefully once they’re there, their research will draw them in. (In case anyone notices, I should get credit for positioning that set of homonyms like that!)

But getting them there in the first place – that’s the hard part. That’s why I have so much hope for deep web indexing. If researchers can build tools that will look into our databases intelligently, then extensive new levels of content will ben opened up to everyone. In particular I think about students who decide that the first few search engine hits are “good enough” for their school project. Usually they’re not good enough, but the students don’t always realize that. If new search engines can truly open up the deep web, the whole playing field changes!

Advertisements
Comments
  1. Eternity says:

    I agree, modern search engines are not efficient enough to pinpoint the content we are actually looking for.

  2. Hi, Thanks for delving into the deep web topic.

    Library blogs have covered ISEN for some time and librarians have indicated some real anticipation for the technology we have patent pending. We are in the development stage now, so expect something this or next year. We’re confident we have the solution which includes AI and human intelligence to catalog the vast numbers of databases not indexed by the conventional search industry.

    Feel free to ask away about what we’re up to…

    -Matt

  3. Most of the current approaches to deep web indexing seem to ignore the contextually parts of the deep web. The deep web has a context and it’s necessary to understand the context of the database before searching. Additionally a crawl and index approach means that valuable part of the deep web, the fact that the databases are spam free and typically scholarly content is lost.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s