If you haven’t heard about Google Goggles yet, it’s worth checking out. We all do text searches, and some folks are doing voice search as well. But how about a visual search? I don’t mean searching for an image – I mean using an image as the search object. Goggles is currently available for Android phones. I’m curious to see whether Google will roll out a version for the iPhone, WebOS, or other platforms. Goggles’ potential is easy to see (no pun intended). Time will tell whether there is a demand for this type of search. The things that work are interesting enough. However, I think the things that Google says it can’t do (yet) are even more interesting!
Archive for the ‘search engines’ Category
IUG 2009 – Robots-Crawlers-Spiders – Automated Searches and Your WebPAC
Posted: May 20, 2009 in conferences, Innovative Interfaces, Innovative Users Group, integrated library system, IUG, iug 2009, library catalog, search enginesTags: ils, iug 2009, libraries
Mark Welge
Innovative Interfaces
Robots can cause increased load on the public catalog.
The crawer tries to follow every link embedded in catalog pages.
The creawlers send search requests at very high volume and speed.
Robots exclusion protocol – depend on voluntary cooperation on the part of search engine providers.
Robots.txt
Read from directory above "/screens"
Publicly viewable
Might be ignored by an ill-behaved crawler
This file is publicly viewable, but it is not directly controllable or configurable by the library.
Innovative’s Strategy with Robots.txt
Allow access to mainmenu.html
Give legitimate search engines a chance to index the main page of catalog
Update robots.txt file with software releases
Extend blocking to new command links
http://my.library.edu/robots.txt
Robots.txt allows Googlebot for google Scholar. This allows crawling of both / and /screens.
Recognizing a problem with crawlers
System slowness
Numerous searches submitted from an "outside" IP address
In a very short time span
In systematic patterns not typical of human users
Check "non-local access attempts allowed" through the character-based interface.
If an IP lookup on an address returns something suspicious, add an entry to the http access table and set the access value to no.
Usage analysis in Release 2007
Apache server
Layer in front of WebPAC
Logging of search activity
Available a zip file 1 day later
Downloadable for analysis with 3rd-party tools
These logs are maintained fror 30 days
Retrieval of "robots.txt" by well-behaved crawlers will be posted to this log.
Searching for apache in the list of process
Restart terminal menu
Show all, Limit by httpd
Deep Web Indexing
Posted: March 14, 2009 in digital libraries, digitization, internet, search enginesTags: digital libraries, digitization, internet, libraries
I came across an interesting New York Times article several days ago: Exploring a ‘Deep Web’ That Google Can’t Grasp. The article explores a shortcoming of current search technologies that librarians have known about and struggled with for quite some time. As good as current search engines may be, they rely primarily on crawlers or spiders that essentially trace a web of links to their ends. That works for a lot of content out on the Internet, but it doesn’t do so well for information contained in databases. So . . . library catalogs, digital library collections, a lot of the things that libraries do aren’t being picked up by the major search engines.
Of course at some level that makes perfect sense. When a web crawler comes to a page with a search box, how is it supposed to know what to do? It needs to input search terms to retrieve search results, but what search terms are appropriate? Is it searching an online shopping website? A tech support knowledgebase? A library catalog? This discussion surfaces again and again particularly as we talk about one of our digital collections. There is a wealth of information here for people researching the history of accounting, but it resides in a database. The database works perfectly well for humans doing a search. The only problem is that they have to find out about the database first. Now we’ve done a number of things to get the word out: papers, conference presentations, a Wikipedia article . . . If we’re lucky, these things will get users to the top level of the collection. Hopefully once they’re there, their research will draw them in. (In case anyone notices, I should get credit for positioning that set of homonyms like that!)
But getting them there in the first place – that’s the hard part. That’s why I have so much hope for deep web indexing. If researchers can build tools that will look into our databases intelligently, then extensive new levels of content will ben opened up to everyone. In particular I think about students who decide that the first few search engine hits are “good enough” for their school project. Usually they’re not good enough, but the students don’t always realize that. If new search engines can truly open up the deep web, the whole playing field changes!
A few days have passed and the hype is already dying down. Time to face facts. Cuil ain’t no Dark Knight. It generated some early enthusiasm, but it hasn’t really sustained it. I keep going back to the site, but I’m just not all that impressed with Cuil. It simply isn’t finding things that appear in the first few hits of a Google search.
Stephanie has written an interesting post on Cuil over at Dube’s World. In her post she tackles one of the annoyances I’ve had with Cuil. What’s up with those random images? Cuil’s search results display is nice. I like the layout of the brief summary. If the pictures were drawn from the website, I would like those as well. Unfortunately, the pictures usually seem to have no relevance to the website with which Cuil pairs them. What’s up with that? Does anyone know how Cuil matches images to search hits?
I noted a few days ago that Cuil couldn’t even find “iron man”. At least they’ve solved THAT little problem.
So I was finally able to get into Cuil and try a few searches. I have to say that I’m not overly impressed with what I’ve seen so far, but I’m trying to stay open-minded and I’ll keep trying it. I’m not convinced that the search results are as contextually relevant as I need them to be. And on some levels, Cuil simply fails, and fails bigtime.
Here are a couple of sample searches.
Now I realize that Iron Man wasn’t the biggest movie of the summer. It’s not going to give the Dark Knight a run for its money. It was reasonably popular though, and you’d think I would be able to find something. Fan sites? Perhaps an IMDB listing? Or hmmm . . . I don’t know . . . maybe the actual movie site?
This is another strange one.
The University of Mississippi only has several thousand web pages, so it’s no wonder Cuil couldn’t find it, huh? Strangely though, if I capitalize my search terms, Cuil does return hits. I wonder if the Cuil folks realize that most people don’t capitalize things when they’re doing a search? Live and learn.
As I was driving in to work this morning, I head an NPR story about the new search engine, Cuil, and I was ready to give it a try. Created by former Google employees, Cuil hopes to challenges Google’s 60+ % market share dominance in part by indexing more pages than any other search engine provider.
So imagine my surprise when I went to the website and got this:
And this:
I like Google. I’ve used it regularly since I first heard about it, and it is my search engine of choice. However, I’m not a zealot. I’m willing to try new search engines to see if they give me better results than my fave. But if you’re going to try to take on Google, you’d better make sure your stuff works on opening day!
An inauspicious start indeed!
