Archive for the ‘search engines’ Category

Google Goggles

Posted: December 11, 2009 in android, search engines
Tags: ,

If you haven’t heard about Google Goggles yet, it’s worth checking out. We all do text searches, and some folks are doing voice search as well. But how about a visual search? I don’t mean searching for an image – I mean using an image as the search object. Goggles is currently available for Android phones. I’m curious to see whether Google will roll out a version for the iPhone, WebOS, or other platforms. Goggles’ potential is easy to see (no pun intended). Time will tell whether there is a demand for this type of search. The things that work are interesting enough. However, I think the things that Google says it can’t do (yet) are even more interesting!

Advertisements

Mark Welge

Innovative Interfaces

 

Robots can cause increased load on the public catalog.

The crawer tries to follow every link embedded in catalog pages.

The creawlers send search requests at very high volume and speed.

 

Robots exclusion protocol – depend on voluntary cooperation on the part of search engine providers.

 

Robots.txt

Read from directory above "/screens"

Publicly viewable

Might be ignored by an ill-behaved crawler

 

This file is publicly viewable, but it is not directly controllable or configurable by the library.

 

Innovative’s Strategy with Robots.txt

Allow access to mainmenu.html

Give legitimate search engines a chance to index the main page of catalog

Update robots.txt file with software releases

Extend blocking to new command links

 

http://my.library.edu/robots.txt

 

Robots.txt allows Googlebot for google Scholar. This allows crawling of both / and /screens.

 

Recognizing a problem with crawlers

System slowness

Numerous searches submitted from an "outside" IP address

In a very short time span

In systematic patterns not typical of human users

 

Check "non-local access attempts allowed" through the character-based interface.

 

http://www/.hostip.info

 

If an IP lookup on an address returns something suspicious, add an entry to the http access table and set the access value to no.

 

Usage analysis in Release 2007

Apache server

Layer in front of WebPAC

Logging of search activity

Available a zip file 1 day later

Downloadable for analysis with 3rd-party tools

These logs are maintained fror 30 days

 

Retrieval of "robots.txt" by well-behaved crawlers will be posted to this log.

 

Searching for apache in the list of process

Restart terminal menu

Show all, Limit by httpd

I came across an interesting New York Times article several days ago: Exploring a ‘Deep Web’ That Google Can’t Grasp. The article explores a shortcoming of current search technologies that librarians have known about and struggled with for quite some time. As good as current search engines may be, they rely primarily on crawlers or spiders that essentially trace a web of links to their ends. That works for a lot of content out on the Internet, but it doesn’t do so well for information contained in databases. So . . . library catalogs, digital library collections, a lot of the things that libraries do aren’t being picked up by the major search engines.

Of course at some level that makes perfect sense. When a web crawler comes to a page with a search box, how is it supposed to know what to do? It needs to input search terms to retrieve search results, but what search terms are appropriate? Is it searching an online shopping website? A tech support knowledgebase? A library catalog? This discussion surfaces again and again particularly as we talk about one of our digital collections. There is a wealth of information here for people researching the history of accounting, but it resides in a database. The database works perfectly well for humans doing a search. The only problem is that they have to find out about the database first. Now we’ve done a number of things to get the word out: papers, conference presentations, a Wikipedia article . . . If we’re lucky, these things will get users to the top level of the collection. Hopefully once they’re there, their research will draw them in. (In case anyone notices, I should get credit for positioning that set of homonyms like that!)

But getting them there in the first place – that’s the hard part. That’s why I have so much hope for deep web indexing. If researchers can build tools that will look into our databases intelligently, then extensive new levels of content will ben opened up to everyone. In particular I think about students who decide that the first few search engine hits are “good enough” for their school project. Usually they’re not good enough, but the students don’t always realize that. If new search engines can truly open up the deep web, the whole playing field changes!

A few days have passed and the hype is already dying down. Time to face facts. Cuil ain’t no Dark Knight. It generated some early enthusiasm, but it hasn’t really sustained it. I keep going back to the site, but I’m just not all that impressed with Cuil. It simply isn’t finding things that appear in the first few hits of a Google search.

Stephanie has written an interesting post on Cuil over at Dube’s World. In her post she tackles one of the annoyances I’ve had with Cuil. What’s up with those random images? Cuil’s search results display is nice. I like the layout of the brief summary. If the pictures were drawn from the website, I would like those as well. Unfortunately, the pictures usually seem to have no relevance to the website with which Cuil pairs them. What’s up with that? Does anyone know how Cuil matches images to search hits?

I noted a few days ago that Cuil couldn’t even find “iron man”. At least they’ve solved THAT little problem.    😉

Cuil? Say what?

Posted: July 28, 2008 in search engines, services
Tags:

So I was finally able to get into Cuil and try a few searches. I have to say that I’m not overly impressed with what I’ve seen so far, but I’m trying to stay open-minded and I’ll keep trying it. I’m not convinced that the search results are as contextually relevant as I need them to be. And on some levels, Cuil simply fails, and fails bigtime.

Here are a couple of sample searches.

cuil4

Now I realize that Iron Man wasn’t the biggest movie of the summer. It’s not going to give the Dark Knight a run for its money. It was reasonably popular though, and you’d think I would be able to find something. Fan sites? Perhaps an IMDB listing? Or hmmm . . . I don’t know . . . maybe the actual movie site?

This is another strange one.

cuil5

The University of Mississippi only has several thousand web pages, so it’s no wonder Cuil couldn’t find it, huh? Strangely though, if I capitalize my search terms, Cuil does return hits. I wonder if the Cuil folks realize that most people don’t capitalize things when they’re doing a search? Live and learn.

An Inauspicious Start

Posted: July 28, 2008 in search engines, services
Tags:

As I was driving in to work this morning, I head an NPR story about the new search engine, Cuil, and I was ready to give it a try. Created by former Google employees, Cuil hopes to challenges Google’s 60+ % market share dominance in part by indexing more pages than any other search engine provider.

So imagine my surprise when I went to the website and got this:

cuil3

And this:

cuil2

I like Google. I’ve used it regularly since I first heard about it, and it is my search engine of choice. However, I’m not a zealot. I’m willing to try new search engines to see if they give me better results than my fave. But if you’re going to try to take on Google, you’d better make sure your stuff works on opening day!

An inauspicious start indeed!