Email goin’ down???

July 24, 2006

Library Stuff has a tongue-in-cheek post about how email is finally “losing” to things like social networking sites, IM, etc. It’s easy to make fun of email. It’s easy to hate email with the burning passion of a thousand suns, in fact, and most of us do.

But it’s also easy to forget that email does a few things really well. Asynchronous messages, queued up and ready to be read, labeled, sorted, and searched, with clear senders and recipients (let’s ignore spam for a moment). Email runs into problems when we try to use it as a todo list, or as semi-synchronous communication, or as a replacement for a real filing system.

The spam problem is solvable: we just haven’t solved it yet. How much IM spam do you get? None, because you only allow communication from people with whom you’ve set up an invitation. We could use whitelists on our email if we had the spine for it, but it’s a pain and doesn’t flow with the ways email is current used.

I predict email will disappear right about the time the book does. ๐Ÿ™‚

Why we’ll never be as good at search as Google

July 11, 2006 ยป asking the right questions, when to be simple, when to be complex echo Dan Chudnov to make a point that I think a lot of people miss:

Maybe 20% of the collection is responsible for 80% of the use but that other 80% includes some really important stuff

When people talk about how sucky their OPAC searches are, usually they’re talking about relevance ranking. Yes, yes, there are a zillion things we can (and must) do to help our patrons search better, incluing spellcheck-like suggestions, better stemming, etc. I won’t argue against any of that, and in fact have argued for it fairly forcefully recently at MPOW.

But what makes Google (and their catching-up competitors) truly useful is PageRank, their authority-based relevency ranking algorithm. And PageRank is absolutely useless for libraries.

PageRank presupposes that (a) there are lots of people “voting” by making links to given resources, and (b) the best resources are the most popular/linked-to.

A research library doens’t follow that model. We don’t have “voting” because, except for the most popular items, it would be worthless. And we don’t focus on popularity because the vast, vast majority of our collection is stupendiously unpopular. I could take a stackful of books at random off the shelf and wander around campus all day and never find a single person who gives a rat’s ass about any of them. Any given journal article is likely to remain unread by anyone on campus forever. Forever!

The obvious candidates to drive ranking — circulation, clickthroughs, etc. — will over ever apply to such a small percentage of items that they’re basically worthless.

Relevence ranking is hard stuff. Can we do better than we are? God, I hope so. But will we ever do as well with our catalogs as Google does with popular web pages? It seems really unlikely — the problem space is just too complex.

First Mover Disadvantage

April 21, 2006

Standards compliant library websites over at pulls together a few pleas from the sometimes-excellent Web4Lib email list.

This is one of those situations where MPOW definitely has a First Mover Disadvantage. In a lot of ways, we were cutting edge before it was cutting edge to be cutting edge, and now we pay the price. The standards weren't around, and certainly weren't attended to by anyone, when an awful lot of our content and systems were produced. So now I'm left with a gigundous website, the vast majority of which is horrible underneath, systems that produce crappy/faulty HTML, nothing that has any sort of REST or SOAP interface…it's a mess. And a mess that's so big it's terrifying to think about cleaning it all up, not to mention all the individual files.

Hmmmm…I wonder how many there are?

find . -name \*html | wc


20234 21352 840277

For those of you keeping score, that's 20,2034 separate files that end in .html on our server. Oops. Even when I throw out the obvious doesn't-need-to-be-CMS'd (statistics, reserves documenent, cache files, newsletters, etc.) it's stil over 11,000 files.

I'd love to put it all into a CMS. But that's a lot of cutting and pasting and cleaning up, even if we had a small army of people to do it, which we don't. Not all of the files are still used, either, but figuring out which ones are legacy, which ones are supposed to be accessible, which are actually used…it's easier than the conversion, but sooner or later there needs to be people making decisions. People whose job is not to sit around trying to figure out which of the pages set up by their predecessor in 1997 still need to be online.

Don't try to tell me that I can just look at the access stats. Do you really think a librarian is going to get rid of something just because it hasn't been accessed in the last year?????

And making people responsible for their own documents doesn't help, because if they knew how to produce compliant output we wouldn't be in this spot to begin with.

Someday, when we all live in a land where you can eat the rainbows and everyone owns their own pony, the big vendors will provide systems that produce good, compliant output. And I'll still be sitting on ten thousand documents that barely render correctly.