Why we’ll never be as good at search as Google

July 11, 2006

librarian.net ยป asking the right questions, when to be simple, when to be complex echo Dan Chudnov to make a point that I think a lot of people miss:

Maybe 20% of the collection is responsible for 80% of the use but that other 80% includes some really important stuff

When people talk about how sucky their OPAC searches are, usually they’re talking about relevance ranking. Yes, yes, there are a zillion things we can (and must) do to help our patrons search better, incluing spellcheck-like suggestions, better stemming, etc. I won’t argue against any of that, and in fact have argued for it fairly forcefully recently at MPOW.

But what makes Google (and their catching-up competitors) truly useful is PageRank, their authority-based relevency ranking algorithm. And PageRank is absolutely useless for libraries.

PageRank presupposes that (a) there are lots of people “voting” by making links to given resources, and (b) the best resources are the most popular/linked-to.

A research library doens’t follow that model. We don’t have “voting” because, except for the most popular items, it would be worthless. And we don’t focus on popularity because the vast, vast majority of our collection is stupendiously unpopular. I could take a stackful of books at random off the shelf and wander around campus all day and never find a single person who gives a rat’s ass about any of them. Any given journal article is likely to remain unread by anyone on campus forever. Forever!

The obvious candidates to drive ranking — circulation, clickthroughs, etc. — will over ever apply to such a small percentage of items that they’re basically worthless.

Relevence ranking is hard stuff. Can we do better than we are? God, I hope so. But will we ever do as well with our catalogs as Google does with popular web pages? It seems really unlikely — the problem space is just too complex.