Why we’ll never be as good at search as Google

librarian.net » asking the right questions, when to be simple, when to be complex echo Dan Chudnov to make a point that I think a lot of people miss:

Maybe 20% of the collection is responsible for 80% of the use but that other 80% includes some really important stuff

When people talk about how sucky their OPAC searches are, usually they’re talking about relevance ranking. Yes, yes, there are a zillion things we can (and must) do to help our patrons search better, incluing spellcheck-like suggestions, better stemming, etc. I won’t argue against any of that, and in fact have argued for it fairly forcefully recently at MPOW.

But what makes Google (and their catching-up competitors) truly useful is PageRank, their authority-based relevency ranking algorithm. And PageRank is absolutely useless for libraries.

PageRank presupposes that (a) there are lots of people “voting” by making links to given resources, and (b) the best resources are the most popular/linked-to.

A research library doens’t follow that model. We don’t have “voting” because, except for the most popular items, it would be worthless. And we don’t focus on popularity because the vast, vast majority of our collection is stupendiously unpopular. I could take a stackful of books at random off the shelf and wander around campus all day and never find a single person who gives a rat’s ass about any of them. Any given journal article is likely to remain unread by anyone on campus forever. Forever!

The obvious candidates to drive ranking — circulation, clickthroughs, etc. — will over ever apply to such a small percentage of items that they’re basically worthless.

Relevence ranking is hard stuff. Can we do better than we are? God, I hope so. But will we ever do as well with our catalogs as Google does with popular web pages? It seems really unlikely — the problem space is just too complex.

2 Responses to Why we’ll never be as good at search as Google

  1. […] OPAChy reminds us in a post entitled Why we’ll never be as good at search as Google that the major problem in using Google’s page-ranking algorithms to compare with OPAC search ranking is that their page-ranking system is not necessarily applicable to our data. OPAChy writes that “PageRank presupposes that (a) there are lots of people “voting” by making links to given resources, and (b) the best resources are the most popular/linked-to.” The fact that the majority of our collection is not heavily used makes the use of relevancy ranking much more complicated. Just because an item hasn’t ever been used doesn’t mean that it isn’t highly relevant to a specific research topic. How does one convey this in terms of relevance? Obviously, academic, public and special libraries may have extremely different needs in terms of relevancy ranking. Since much of the recent criticism of OPACs relates to lack of relevancy ranking, I think we need to look closer at this issue in order to determine what we need for relevancy ranking. Our collections are very different from Google and even from Amazon and sometimes I think we forget this fact. Ultimately, I think this relates to the fact that Google provides users with many sources for a given topic, but that libraries are trying not to provide just sources, but the best sources available for a given topic. This is a critical distinction. […]

  2. Scott says:

    But there are some things we could do to improve relevance. As one example… certain publishers/imprints tend to publish certain types of materials (DAW books, as an obvious example, tends to publish speculative fiction.) Why not create a database that could adjust relevance weights based on such data? If someone is looking for material on buddhism, even if a DAW book had “Buddha” in the title it might be reasonable to weigh its relevance less than a book published by, say, Shamballah Press.

    It’s not the same kind of relevance as page rank/popularity, but it could certainly bring some added value to searches.

Leave a comment